Acquiring NCAA Softball Stats with R

A nice instructional piece was just posted to the The Hardball Times website by Bill Petti on how to acquire NCAA baseball data. I’ve been asked many times before if the NCAA provides a database of softball data and my response is always “unfortunately, no”. By following Bill’s outline, his instructions can be modified to acquire NCAA softball data as well.

Here’s the link to the article: Research Notebook: Acquiring NCAA Baseball Stats with R.

For other data collection tools for use with women’s softball, see new-softball-research-tools-released-ncaa-softball.

If you don’t know how to use R and/or want to learn more about sabermetrics, there is a course that combines the two titled Sabermetrics 101.

Revisited: Do power-heavy or speed-heavy teams do better in the postseason?

By Matt Meuchel,

I have revisited the concept that I investigated in a previous article about power-heavy teams and speed-heavy teams. Last time I examined it, it was with 2 years of data (2013 and 2014). Now that I have 4 seasons of data I wanted to check back in on it. This time I added the 2 other parts of the equation that I didn’t check.

So here is a refresher on the methodology: I looked at the 2 different statistics to examine if a team was an All Power Team or an All Speed Team. I looked at Isolated Power (ISO) and Stolen Bases per Game (SB/G). For each of the 4 seasons I identified all teams that were 1 and 2 standard deviations above and below the mean in both statistics. From this I could find those teams that fit into 4 different categories:
1) Teams with Above Average Power and Above Average Speed
2) Teams with Below Average Power and Below Average Speed
3) Teams with Above Average Power and Below Average Speed (Termed “All Power Teams”)
4) Teams with Below Average Power and Above Average Speed (Termed “All Speed Teams”)

Here are the statistics associated with these categories:

1) There were 31 teams in 4 seasons that qualified as Teams with Above Average Power and Above Average Speed. Of those 31 teams 11 (35.5%) did not make the post season and 20 (64.5%) made the post season. Of those that made the post season 11 were Regional Teams (35.5% of total), 3 were Super Regional Teams (9.7% of total), and 6 were WCWS Teams (19.4% of total).

2) There were 40 teams in 4 seasons that qualified as Teams with Below Average Power and Below Average Speed. Of those 40 teams 40 (100%) did not make the post season and 0 (0%) made the post season.

3) There were 22 teams in 4 seasons that qualified as “All Power Teams”. Of those 22 teams 14 (63.6%) did not make the post season and 8 (36.4%) did make the post season. Of those that made the post season 6 were Regional Teams (27.3% of total), 1 was a Super Regional Team (4.5% of total), and 1 was a WCWS Team (4.5% of total).

4) There were 30 teams in 4 seasons that qualified as “All Speed Teams”. Of those 30 teams 27 (90%) did not make the post season and 3 (10%) did make the post season. Of those that made the post season all 3 of them were Regional Teams (10% of total) and none were Super Regional or WCWS teams.

In conclusion, of course teams desire to fit into category 1 where you have above average power and speed and no one wants to be in category 2 where you have below average power and speed. Even though these are very intuitive thoughts I wanted to put the stats to these two categories. Categories 3 and 4 were the categories I examined before and that I wanted to update the stats on. I will say that both Categories 3 and 4 have become less favorable toward post season play with 2 more seasons of data. Category 4 was not favorable before after 2 seasons and is even less after 4 seasons of data. Funny thing about Category 4 is that it is relatively stable (in the 4 years it had 7, 8, 7, and 8 teams fit into this category in individual years) for the amount of teams that qualify for it as well as for those that make the post season (1 team made it in 2013, 2014, and 2015 but none for 2016). Category 3 had more favorable post season stats after 2 seasons (where 50% qualified for the post season) than after 4 seasons, which was interesting. Even given that regression backward, this category has over 3 times as many of it’s members qualify for the post season compared to Category 4.

Christopher Long’s D-I Top 20

Let’s start out by saying that the number-one team in D-I softball for 2016 was not who you think it was.

Following up on the D-III top 20 and the D-II top 20 as provided by Detroit Tigers analyst Christopher Long, here are his season-ending top 20 teams in D-I softball. Long’s rankings take into account the impact of a team’s home field, their offensive and defensive strength, and their strength of schedule. And yes, there’s a surprise at #1.

Rank

School Overall Strength Home Park Offensive Strength Defensive Strength

SOS

1

Florida 7.887 0.956 2.198 0.279

1.478

2

Oklahoma 6.036 1.061 2.297 0.381

1.410

2 (tie)

Oregon 6.036 0.973 2.579 0.427

1.397

4

Michigan 6.029 0.993 2.465 0.409

1.352

5

Auburn 5.857 1.018 2.618 0.447

1.434

6

Alabama 5.159 1.000 2.218 0.43

1.468

7

Florida St. 5.029 0.984 2.25 0.447

1.384

8

Washington 4.934 0.971 2.719 0.551

1.515

9

UL Lafayette 4.706 0.998 2.246 0.477

1.241

10

Georgia 4.566 1.012 2.116 0.463 1.417

11

Missouri 4.449 0.968 2.391 0.537 1.447

12

LSU 4.336 0.907 2.060 0.475

1.478

13 Tennessee 4.321 0.993 2.231 0.516

1.406

14 Texas A&M 3.938 0.947 2.530 0.642

1.474

15

UCLA 3.861 0.992 2.289 0.593 1.531

16

Minnesota 3.754 1.021 2.017 0.537 1.294
17 James Mad. 3.752 1.054 1.485 0.396

1.149

18

Kentucky 3.648 1.051 1.531 0.420

1.351

19 Arizona 3.420 1.188 1.672 0.489

1.457

20

Utah 3.391 1.205 1.596 0.471

1.438

Yes, of course Oklahoma won the national championship. But for their body of work over the course of the season, according to Long’s calculations Florida was clearly the best team in the country.

The teams from the Women’s College World Series were well represented in the rankings, with the final 8 teams all ranked in the top 15.

And not to beat a dead horse, but what happened to #2 Oregon? Their ouster at home by UCLA in the Super Regionals is even more surprising seeing these rankings.

Christopher Long’s D-II Top 20

Following up on the D-III top 20 as provided by Detroit Tigers analyst Christopher Long, here are his season-ending top 20 teams in D-II softball. Long’s rankings take into account the impact of a team’s home field, their offensive and defensive strength, and their strength of schedule

Rank

School

Overall Strength

Home Park

Offensive Strength

Defensive Strength

SOS

1

N. Alabama 2.761 0.980 2.135 0.432 1.005

2

N. Georgia

2.579

0.933

1.787

0.387 0.971

3

Saint Leo 2.412 1.003 1.516 0.352

0.938

4 Humbldt St. 2.340 1.145 1.700 0.406

0.967

5

W. Tx. A&M 2.156 1.152 1.938 0.503 0.904
6 Valdosta St. 2.088 0.997 1.856 0.497

0.982

7

S. Arkansas 2.065 0.943 1.677 0.454 0.938
8 Armstrng St 1.870 0.967 1.641 0.491

0.999

9

U. Indy 1.820 0.983 1.580 0.486 0.83
10 Mo.-St. Lou. 1.807 0.977 1.322 0.409

0.836

11

Ark. Tech 1.806 0.905 1.509 0.467 0.935
12 WV Wesl. 1.682 1.010 1.378 0.458

0.81

13

Ala.-Hunts. 1.611 1.016 1.689 0.587 0.993
14 W. Florida 1.54 0.956 1.597 0.580

0.986

15

Cal. Baptist 1.507 0.97 1.438 0.533 0.871

16

Rollins 1.505 0.951 1.378 0.512 0.904
17 Wayne St. 1.496 1.045 1.27 0.475

0.793

18

Azusa Pac. 1.457 1.022 1.556 0.597 0.898
19 Chico St. 1.444 0.960 1.452 0.562

0.937

20 Georgia C. 1.439 0.950 1.724 0.670

0.901

 

Christopher Long’s D-III Top 20

Christopher Long, currently of the Detroit Tigers, has the ability to analyze almost any sport. This includes softball. Best of all he shares the code he uses and his findings at his GitHub site https://github.com/octonion. This is the same Christopher Long who makes an appearance in the book The Only Rule Is It Has To Work, which is a great read on the benefits and challenges of applying analytics to minor-league baseball. But I digress.

I just noticed that Christopher has updated his year-end softball rankings, which take into account the impact of a team’s home field, their offensive and defensive strength, and their strength of schedule.

Here are his rankings for D-III.

Rank School

Overall Strength

Home Park

Offensive Strength

Defensive Strength

SOS

1 Texas-Tyler

2.239

0.987

2.285

0.245

0.801

2 CMS

1.353

0.916

1.830

0.325

0.776

3 Salisbury

1.312

0.96

2.122

0.389

0.664

4 Berry

1.256

0.992

2.18

0.417

0.735

5 E. Tx. Bapt.

1.244

0.903

2.411

0.466

0.791

6 Va. Wslyn.

1.241

0.969

1.82

0.353

0.714

7 Emory

1.215

0.995

2.26

0.447

0.75

8 Rowan

1.201

1.011

2.159

0.432

0.708

9 Texas Lu.

1.192

0.926

2.167

0.437

0.667

10 Linfield

1.171

0.959

2.204

0.453

0.856

11 Luther

1.139

1.03

1.71

0.361

0.653

12 Birm.-So.

1.063

0.919

1.979

0.448

0.753

13 St. Thomas

1.047

0.894

1.693

0.389

0.705

14 Trine

0.951

1.065

1.813

0.459

0.747

15 Messiah

0.930

1.016

1.847

0.477

0.689

16 Pacific (OR)

0.924

0.96

2.001

0.521

0.856

17 Chris. Newp.

0.923

1.034

2.029

0.528

0.734

18 Whitworth

0.904

0.911

1.763

0.469

0.853

19 George Fox

0.879

1.032

1.708

0.467

0.841

20 La Verne

0.873

0.991

1.754

0.483

0.735

Since I am an assistant coach at Claremont-Mudd-Scripps Colleges, of course it is nice to see that we’re ranked #2. But what I really can’t help but notice is the strength of the West Region. In all a remarkable 9 out of top 20 teams play in the West Region and just 7 of these teams made the playoffs. Left out of the mix were #16 Pacific (25-16-1 overall) and #20 La Verne (31-11). Because the NCAA’s priority for D-III softball is saving money by flying as few teams as possible to Regional locations, not only were two teams left out of the post-season but the remaining 7 teams were fighting for just one spot in the eight-team World Series. Not surprisingly that one team from the West Region, Texas-Tyler, won the national championship.

To learn more about Christopher Long and his work, follow him on Twitter at @octonion or see his blohttp://angrystatistician.blogspot.com.

Infographic: 2016 WCWS

WCWS 2016

It turns out that a comment by Adelphi associate head coach Ophir Sadeh is right: women’s college softball is a perfect fit for television. This year’s D-I Women’s College World Series featured competitive games played within a reasonable amount of time. In fact 6 of the 15 WCWS games were decided by just 1 run and 9 games were decided by 2 runs or less. There were a record 78,072 fans in attendance in Oklahoma City and the title game between Oklahoma and Auburn easily topped the cable sports TV ratings for June 8.

Regarding Strikeouts, Softball is Becoming a Contact Sport

A trend of fewer strikeouts in D-I women’s softball continued for a sixth season in 2016. Strikeouts dropped to the lowest point since 2001, recorded at 4.59 strikeouts per 7 innings pitched.

Here are strikeouts per 7 innings pitched since 1982.

Strikeouts Scoring D-I

Strikeouts bottomed out in the heart of the small-ball era, reaching their low point in 1987 at just 2.7 strikeouts per 7 innings pitched. Strikeouts peaked in 2010 at 5.48 per 7 innings pitched.

Strikeout data was provided by Nevada head coach Matt Meuchel, who also looked deeper into strikeouts over the past 4 seasons in D-I softball. Matt found that strikeouts per plate appearance account for 24.84% of the variance in runs per game for all of the teams in D-I softball. This likely means that while strikeouts have an effect on scoring, it’s not the strong correlation that say home runs are to scoring.

Strikeout PA Scoring D-I

Matt also took a look at the relationship between strikeouts looking and runs per game over the past 4 seasons.

Strikeout Looking PA Scoring D-I

As shown above strikeouts looking accounted for 11.02% of the variance in runs per game. The weakness of the relationship can also be seen by how far each team’s point strays from the trendline.

For me I would say Babe Ruth put it best when he said, “Never allow the fear of striking out keep you from playing the game”. In women’s softball the strikeout is becoming less of a fear all the time.

Softball’s Golden Age of Defense

It’s easy to think that back in the day, when we were younger, the game was somehow better. It doesn’t seem to matter what that game is but in our memory that game was somehow at its best in the past.

For women’s softball it seems to be the many players and coaches who fondly remember softball’s small-ball era. From 1982-1992 and again from 2001-2004, D-I women’s softball was in a low run-scoring environment (see the chart on scoring in D-I softball). Since 2004 scoring has jumped over 35% making some long for the old days of pitching and defense.

But were those times really so golden? From a fielding-percentage perspective, it wouldn’t seem that they were.

Fielding D-I

Though fielding percentage may not be the ideal metric for measuring defense, it’s what we have available. And despite what we might think according to our memories, the numbers show that we now could be in the golden era of defense.

A thanks again to Nevada’s Matt Meuchel for making all of this data available.

Scoring in D-I Softball from 1982 to the Present

There are times when I wonder why I have this site, if it’s worth the trouble, and if anyone really reads it. Then there are other times, like today, where I receive a piece of research unexpectedly and it feels like Christmas. I guess that means Nevada’s head coach Matt Meuchel is Santa Clause.

Matt, who is known in softball circles as a numbers person, just sent me a package of softball research. I will try to release some of these stats on a daily basis. A big thank you to Matt for sharing!

Recently I wrote about some offensive trends in softball. Matt’s numbers go much farther than mine and thus provide a richer picture of trends in D-I softball. Here are runs per game, per team since 1982.

D-I Scoring

Scoring in D-I softball hit its low point in 1986 at just 3.02 runs per game, per team. Scoring peaked last year at 4.81, a remarkable increase of 63 percent.

Matt looked at the relationship between batting average and scoring and found a strong correlation. Batting average accounts for 84.46% of the variance in scoring over the past 35 seasons in D-I softball. The strength of this relationship is also shown by how closely each point below is located in relation to the trendline.

Batting Avg Scoring D-I

According to my rough estimate every 10 points of batting average accounts for a change in scoring of about .25 runs per game. If I take a leap and infer that a team increases its batting average by 10 points, over the course of a 50 game season that team can expect to score 12.5 more runs. Plugging that number back into the formula for the Pythagorean Theorem, such an increase would mean 1 more win for that team.

One element of batting average is home runs. Part of the problem with batting average is that it treats all hits (singles, doubles, triples, and home runs) equally. In reality we know that all hits aren’t created equal since a home run is much more valuable than a single. Matt also looked at whether scoring correlates with home runs.

HR Scoring D-I

With an R2 or coefficient of determination of .7292, scoring correlates quite well with just home runs. Using the example before of a 50 game season, if your team were to hit around six more home runs each season I would expect you to win one more game.

Matt also found that there is little relationship between stolen bases per game and the number of runs that are scored.

SB Scoring D-I

What is interesting to me is that even in years where the run-scoring environment was low, the number of bases that were stolen doesn’t appear to correlate to the number of runs scored. This seems to reinforce my previous research on stolen bases which showed that it’s not the number of bases that a team steals but how efficiently they do it that matters.

Thank you again Matt! Much more to come.

Softball in 2016 and the Music of the Sphere

Pythagoras created the concept of the music of the spheres which suggested that the universe, like music, is governed by mathematical equations. Using a modified version of Bill James’ Pythagorean Theorem it can be shown that the softball universe is also governed by an equation.

James created his version of the Pythagorean Theorem in order to demonstrate the relationship in baseball between the number of runs a team scores over the course of a season, the number of runs it allows, and its winning percentage. James’ equation looked like:

Pythagorean Formula

James eventually found that by adjusting the exponent, the equation did an even better job of predicting winning percentage. For baseball the exponent is somewhere around 1.82, though it can vary slightly on a yearly basis.

Here is a women’s softball version of the Pythagorean Theorem applied to each NCAA division from the 2016 season.

Division III

The following scatter plot compares the record of each team in D-III in 2016 against how well the formula estimated each team’s record.

D-III

Each dot in the above chart represents one of the 410 teams in D-III. Because the points are so close to the trendline, this means that the formula fits quite well. This is also demonstrated by the R2 = .9229. This figure means that the formula accounts for 92.29 percent of the variance between the trendline and the data points, which is quite high. The farther away from the trendline that a point is could suggest either good luck or bad luck for that team in the 2016 season. And with an exponent this low (1.52 for D-III in 2016), luck certainly plays a large role in the game.

Here are the five unluckiest teams in D-III in 2016 according to the formula.

Institution Runs Scored Runs Allowed Actual Winning % Predicted Winning %
Lesley

282

136

.564

.752

St. Scholastica

307

115

.634

.816

Kean

190

132

.476

.635

Staten Island

254

99

.650

.807

Minn.-Morris

220

252

.300

.449

Lesley finished the season 22-17. According to their run differential they should have been 29-10. Their bad luck can be seen in their losses, which included 11 losses of three runs or less. Shown below is where Lesley fell on the scatter plot.

D-III Lesley

Here were the five teams that benefited the most from good luck in D-III in 2016, meaning that they outperformed their predicted winning percentage.

Institution

Runs Scored

Runs Allowed

Actual Winning %

Predicted Winning %

Widener

235

170

.738

.621

North Central (IL)

150

180

.526

.431

Penn St. Harrisburg

118

191

.419

.325

Elmhurst

157

217

.474

.379

Emmanuel (MA)

177

172

.605

.511

The sun shone brightest on Widener (31-11) in 2016, which should have gone 26-16 this past season according to the formula. Widener is shown on the scatter plot below.

D-III Widener

A look at their schedule confirms the Pride’s good fortune in 2016 since they won 16 games by two runs or less.

For the team where I’m an assistant coach, Claremont-Mudd-Scripps Colleges, the formula suggests we should have gone 38-9-1. In reality we were 37-10-1. That’s not much of a difference but I can certainly think of one more win I’d like to have.

Overall the formula estimated the final record of 82 teams within 1/2 a win.

Division II

For Division II softball the formula, with its exponent of 1.55, is even more accurate as it accounted for 94.79 percent of the variance in the record of the 295 teams.

D-II

The team with the worst luck in D-II was Spring Hill College. Not that the Badgers didn’t have a great season at 37-14, but according to their run differential they should have been 43-8. This is also borne out in their schedule which shows seven losses of two runs or less.

Institution

Runs Scored

Runs Allowed

Actual Winning %

Predicted Winning %

Spring Hill

339

115

.725

.842

Shaw

210

308

.244

.356

Cal St. East Bay

183

182

.391

.502

Coker

182

215

.326

.436

Wis.-Parkside

147

277

.163

.272

In contrast, Livingstone (5-20) had the most luck of any team in D-II by percentage since their run differential should have resulted in a record of 3-22.

Institution Run Scored Runs Allowed

Actual Winning %

Predicted Winning %

Livingstone 85 327

.200

.110

Grand Valley St. 250 153

.768

.682

Wheeling Jesuit 226

192

.636

.562

Quincy 126

216

.371

.302

Florida Tech 269

178

.722

.655

Overall the formula estimated the final record of 56 teams within 1/2 a win. One of those teams was that of our friends at Adelphi University, who finished with a .614 winning percentage. The formula estimated the Panthers would have a .617 winning percentage. Not bad!

Division I

Using an exponent of 1.5, the formula in D-I softball accounted for 92.53 percent of the variance in the record of the 295 teams.

D-I

Utah Valley topped the list of most unlucky teams in D-I. The Wolverine (9-43) should have won six more games than they did according to the formula. Looking at their schedule their bad luck can be confirmed by 15 losses of two runs or less. Here are the five most unlucky teams by percentage in D-I.

Institution

Runs Scored

Runs Allowed

Actual Winning %

Predicted Winning %

Utah Valley

200

362

.173

.291

Iowa

198

291

.250

.359

Prairie View

168

239

.263

.371

Jackson St.

168

271

.222

.328

UConn

188

204

.365

.469

On the other side of the coin, Fresno State (42-12-1) topped the list of teams outperforming their predicted winning percentage. The Bulldogs should have won four fewer games in 2016 according to their run differential.

Institution

Runs Scored

Runs Allowed

Actual Winning %

Predicted Winning %

Fresno St.

347

219

.780

.666

Nevada

244

200

.681

.574

A&M-Corpus Chris

107

204

.375

.275

Central Ark.

232

222

.614

.517

Jacksonville

228

307

.483

.390

All in all the formula estimated the final record of 56 teams within 1/2 a win.