An Example of Cleaning Untidy Data with Tidyr

 

Quote of the Day

I wish to do something great and wonderful, but I must start by doing the little things like they were great and wonderful.

— Albert Einstein


Introduction

Figure 1: P-61 Black Widow, the most expensive Army Air Force Fighter in WW2.

Figure 1: P-61 Black Widow, the Most Expensive Army Air Force Fighter of WW2. (Source)

I recently decided to take some classes in data analysis at Datacamp, an online training site. My first classes were in dplyr and tidyr – two excellent R-based tools for manipulating files that not amenable to analysis because of inconsistencies and structure: tidyr provides many tools for cleaning up messy data, dplyr provides many tools for restructuring data. After completing the two classes, I decided that I needed to firm up my knowledge of these tools by applying them to a concrete example.

The gnarliest data that I know of is hosted on the Hyperwar web site. The data seems to be scanned and OCRed WW2 records that were translated into HTML. While a fantastic resource for history aficionados, these records are painful to analyze using software because of:

  • Numerous OCR errors
    The original source is often difficult to figure out.
  • Inconsistent expression of zeros
    I see blanks, zeros, periods, dashes.
  • Mixing data and totals in seemingly random ways
  • Arithmetic errors (e.g. incorrect totals)
    I often think of some poor secretary/yeoman typing numbers all day and then having to use an adding machine to complete a total. It would have been miserable work.
  • Typing errors
    For example, a cargo ship being listed as 70,006 tons during WW2 was most likely 7,006 tons. At that time, cargo ship were typically in the 10,000 ton range. This error actually occurred. I was able to confirm the error because the Wikipedia has a remarkable amount of data on WW2 ships, and it listed 7,006 tons for this ship.
  • Unbelievably complex headings (literally three and four levels)

This is not a complaint about the Hyperwar site itself. This is just a statement about the format and quality of WW2 records that were written by humans typing numbers at keyboards. For this post, I decided to download  data on the cost of WW2 Army Air Force combat aircraft and clean it up for graphing. All of the heavy lifting is done in RStudio – only the Appendix is done in Excel because I like how it formats tables.

Background

Problem Statement

I have read quite a bit about air operations during WW2, and I often have read that the P-47 and P-38 were very expensive fighters compared to the P-51. One reason given for  the P-51 replacing the P-47 and P-38 in many applications was its cheaper unit cost. I decided that I would research the cost of these aircraft as a motivating application for practicing with tidyr and dplyr.

Definitions

tidy data
Tidy data is a term coined by Hadley Wickham (creator of tidyr and dplyr) to describe data that is formatted with variables in columns and observations in rows with no blank rows or columns. Missing data is clearly marked in a consistent matter, as are zeros. If you are a database person, you will see hints of his concepts in Codd's Rules for relational databases.
untidy data
Untidy data is data that is not tidy – I hate definitions like this, but it works here. There are some common problems:

  • Values of variables used for column headings (e.g. column headings containing years – years should be a column with specific year values stored for each row).
  • Data that should be in multiple tables combined into one table. (e.g. describing a person's pets means multiple table rows for people with more than one pet – the pet details should be in a separate table.)
  • Aggregate data (e.g. totals) in the rows.
  • Blank rows or columns.
  • Single observation scattered across multiple rows.

Analysis

All costs are expressed in WW2 dollars. The point of this exercise is to provide a worked Rmarkdown-based example of cleaning an untidy data file. Converting WW2 dollars to modern dollars is fraught with issues, which I leave to others.

Rmarkdown File

The Rmarkdown file is a bit long, so I include a PDF of it here as a link. For the complete source, see this zip file. The zip file contains the raw Hyperwar tab-separated file, a cleaned-up .csv, Rmarkdown source, and a PDF of the processed Rmarkdown document.

Graphs

I am not interested in plotting all the data. There are just a few WW2 aircraft in which I am interested. I plotted the costs of my favorite WW2 fighters and bombers versus time. Note that some of the costs dropped dramatically over time. This is true for any product (example).

Fighter Plane Costs Versus Time

Figure 2 show how the unit costs changes with time for my favorite WW2 fighter planes. Note that the P-38 and P-47 are quite expensive relative to the P-51. The most expensive US fighter in WW2 was the P-61 night fighter (Figure 1).

Figure M: WW2 Army Air Force Fighter Plane Costs Versus Time.

Figure 2: WW2 Army Air Force Fighter Plane Costs Versus Time.

Bomber Plane Costs Versus Time

Figure 3 show how the costs changes with time for my favorite WW2 bombers. Notice how the B-29 was ~2.6x more expensive that the B-17. However, only the B-29 had the range needed to bomb Japan from the Marianas Islands.

Figure M: Army Air Force Bomber Costs in WW2.

Figure 3: WW2 Army Air Force Bomber Costs in WW2.

Cleaned-Up CSV File

The Rmarkdown file outputs a cleaned-up version of the data as a .csv file. I include that file in with my source. You can see a tabular version of the data in Appendix A.

Conclusion

I work in corporate America, where Excel is the standard for processing data. While Excel has some strengths, it is not the most powerful data analysis tool available. RStudio provides a wonderful data analysis and presentation environment. This post provides a fully worked example of how to use RStudio with tidyr and dplyr to tidy-up some very messy data.

Appendix A: Hyperwar Aircraft Cost Table Reconstruction.

Figure 4 shows how my tidy data version of the cost data can easily be converted back into the original form.

Figure M: Import of Cleaned Table Into Excel for Table Presentation.

Figure 4: Import of Cleaned Table Into Excel for Table Presentation.

Save

Save

Save

Save

Save

Save

Save

Posted in Excel, History Through Spreadsheets | Leave a comment

Quick Look at Large US Dams

 

Quote of the Day

Amateurs practice until they get it right. Professionals practice until they can’t get it wrong.

— Special Operations Credo. I hear doctors say something similar. It is true in Engineering and Software as well. For a professional, it is about far more than getting things right – that is a given. It is about having personal processes that reduce the possibility of mistakes and ensure that you can handle any contingency that might arise.


Introduction

Figure 1: Oroville Dam. (Wikipedia)

Figure 1: Oroville Dam. (Wikipedia)

I was on the phone this morning with a coworker who lives in California, about 150 miles south of the Oroville dam (Figure 1). This dam has recently been in the news because of concerns that spillway erosion could cause a dam failure. At one point, nearly 200K people were evacuated from the potential flood zone.

My coworker was quite familiar with the situation in Oroville, and he mentioned that warnings had been given in years past that this situation could occur. People are now asking how could something like this occur. I will leave that for the politicians to explain.

I was surprised to hear that the Oroville dam is the tallest dam in the US – I guess I remember hearing as a child that Hoover Dam was the tallest. This got me curious as to what are the tallest dams in the US and where they are located. The Wikipedia has an excellent list of the 86 tallest dams in the US. I used Power Query to grab the list and pivot tables to examine the data. For those interested, my source is here.

Figure 2 shows the top 10 tallest dams in the US and their locations. Height is expressed in feet.

Figure 1: Ten Tallest Dams in the US.

Figure 2: Ten Tallest Dams in the US. Height in ft.

I also was curious about when most of these 86 dams were built, which I show in Figure 3. It looks like the 1960s was a big decade for dam building.

Figure 3: Decades When Tallest Dams in US were Built.

Figure 3: Decades When Tallest Dams in US were Built.

I also looked at where the dams were built (Figure 4). By region, the Pacific Contiguous (coast) and Mountain West regions had the largest number of dams by far. I used the US regions as defined by the Department of Energy.

 Figure 4: Dam Locations By US Region.

Figure 4: Dam Locations By US Region.

Figure 5 shows that California has the largest number of these tall dams.

Figure 5: Dams By State.

Figure 5: Dams By State.

Save

Save

Save

Save

Save

Save

Posted in Excel, History of Science and Technology | 3 Comments

Daily Tree Consumption for Toilet Paper

 

Quote of the Day

When the satisfaction or the security of another person becomes as significant to one as one's own satisfaction or security, then the state of love exists. Under no other circumstances is a state of love present, regardless of the popular usage of the term.

— Harry Stack Sullivan

Introduction

Figure 1: Typical Roll of Toilet Paper.

Figure 1: Typical Roll of Toilet Paper. (Source)

I was reading an article on National Geographic when I spotted an interesting factoid about the impact of Toilet Paper (TP) world-wide tree consumption.

Toilet paper wipes out 27,000 trees a day.

Like many factoids, I doubt there is a way to actually measure this number – it can only be estimated. Thus, it is a prime candidate for a Fermi solution.

I see factoids like this all the time. My favorite factoid in the fiber optic business is that 99% of the transoceanic Internet traffic is carried by submarine cables. If you ask around, no one can tell you how the remaining 1% is carried – 1% of total transoceanic bandwidth is a lot of bandwidth for non-fiber transports (e.g. Iridium, TDRS). I heard a submarine cable expert say that 99.999% is probably closer to the true value, but people hedge their numbers by saying 99%. A better answer is that virtually 100% of transoceanic Internet traffic is carried by submarine cables, and the tiny amount not carried by submarine cables is so small that no one knows what it is. An example of a place requiring  transoceanic data service and that has no transoceanic fiber access is Antarctica.

Background

General References

The following links provided me some good background for the analysis that follows.

  • National Geographic article mentioning the factoid (Link).
  • Blog post on toilet paper rolls per tree (Link).
  • Typical toilet paper measurements (Link).
  • Tree pulp statistics (Link).
  • Wikipedia on tree pulp (Link).
  • General toilet paper info (Link).

Some Tree Statistics

The journal Nature reports that:

  • The world is home to more than 3 trillion trees.
  • People cut down 15 billion per year.
  • The number of trees has declined by 46% since the beginning of human civilization.

Average Mass of Harvested Tree

The mass of the average tree harvested for pulp can be estimated using Equation 1, which is an empirical formula developed by the US Forest Service. This formula gives us the typical mass of a tree based on it diameter. The parameters are species-dependent. For this exercise, I assumed the trees are aspens, which are commonly used for pulp where I live. The specific parameters (β0, β1) are given in Appendix A.

Eq. 1 \displaystyle {{m}_{{Tree}}}(d)={{e}^{{{{\beta }_{0}}+{{\beta }_{1}}\cdot \text{ln}\left( d \right)}}}

where

  • mTree is the mass of the tree [kg].
  • d is the diameter of the tree measured at breast height [cm]. This parameter is often referred to as "d.b.h."

The US Forest Service has a number of other mathematical models for tree mass versus diameter. I chose this one because it was easy to code.

Analysis

Figure 2 shows my analysis. I included many comments in-line, so I will not go through the details in my introductory text. For those who want to view my source, I include it here.

Figure M: Estimates of Trees Consumed By Toilet Paper Usage.

Figure 2: Estimates of Trees Consumed By Toilet Paper Usage.

Conclusion

I can see where the 27K number is plausible. The actual number of trees cut for use in toilet paper is probably unknowable and can only be estimated. Unfortunately, the analysis is sensitive to parameters that are highly variable:

  • percentage of people that use TP.
  • amount of TP used per person.
  • diameter of trees harvested for TP.

I suspect that the 27K trees per day number is probably low. Even with the uncertainty involved, it is an interesting number because it shows the environmental impact  of a small item can be substantial if enough people use it.

Appendix A: Formula for Tree Mass vs Diameter.

Figure 3 shows the formula that I used to estimate the mass of a tree based on its diameter.

Figure M: US Forest Service Formula for Tree Mass vs Diameter.

Figure 3: US Forest Service Formula for Tree Mass vs Diameter. (Source)

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

Posted in General Mathematics, General Science | 4 Comments

Fact Checking: US Murder Rate Over Time

 

Quote of the Day

I was planning to vote anarchist, but they don't seem to have any candidates.

— Michael Rivero


Figure 1: US Murder Rate Versus Time. (Data Source)

Figure 1: US Murder Rate Versus Time. The units are murders per 100K population. (Data Source)

I recently heard a politician claim that the US murder rate is the "highest it's been in 47 years." This is an easy fact to check and provided me another Power Query example to provide for my staff.

This statement is not correct. The statistics have not been released yet for 2016, but I can easily look at the data from 1960 to 2015. The actual 2015 rate is the ninth lowest in the 46 years period from 1960 through 2015. In fact, 2013 and 2014 had the lowest murder rates during that time interval. You have to go back to the 1950s to find significantly lower murder rates than we are experiencing now – this data is included in the spreadsheet attached below.

Here is the annual murder rate rank computed on my spreadsheet.

Figure 2: Years of Lowest Murder Rates.

Figure 2: Years of Lowest Murder Rates.

For those who want to follow my work, here is my source.

Save

Posted in Excel, Fact Checking | 2 Comments

Fact Checking: Comparison of Military Budgets

 

to hQuote of the Day

The best way to predict the future is to create it.

— Abraham Lincoln


Introduction

Figure 1: Largest 15 Defense Budgets. The yellow rows indicates the level at which non-US defense budget sum exceeds US total. (Data Source)

Figure 1: Largest 15 Defense Budgets (2015). The yellow rows indicates the level at which non-US defense budget sum exceeds US total. (Data Source)

I heard a news commentator say that the US defense budget is larger than the combined defense budgets for the next ten largest spenders. I thought that this  would be easy to check and would provide my staff a beginner's example to use for their Excel self-training.

I was able to find the list of the top fifteen countries with respect to defense spending in 2015 on the Wikipedia. I used Power Query to grab the data, clean it up, and generate a running total – Figure 1 is my final result. The particular skill I wanted to illustrate with this exercise was how to generate a running total in Power Query.

Yes, the US defense budget is larger than the next ten largest defense budgets combined. My source file is here for those who wish to follow my work.

Save

Save

Posted in Civics Through Spreadsheets, Fact Checking | 2 Comments

History Through Spreadsheets: Executive Orders

 

Quote of the Day

The tragedy of war is that it uses man's best to do man's worst.

Henry Fosdick. Some of the finest engineering I ever saw occurred when I was working on defense contracts. I often wished that kind of brain power could have been focused on space exploration, curing diseases, and educating others.


Figure 1: Average Number of Yearly Executive Orders By President.

Figure 1: Average Number of Yearly Executive Orders By President. (Data Source)

During my recent seminar on Excel's Power Query feature, I showed my team how to grab data executive order data from the web and generate a simple plot (Figure 1). After generating the plot, I asked the audience what we could learn from this graph. I was expecting to hear that the early 1900s – the time between Teddy Roosevelt and Franklin Roosevelt – was a time of massive use of executive orders.

The audience response was much different than I anticipated:

  • They did not realize that executive orders have existed since the beginning of the US – Washington issued executive orders.
  • Many believed that Obama had issued an unusual number of executive orders. In fact, he is somewhere in the middle of the pack and slightly less than George W. Bush. You can double check this on a Snopes.
  • The enormous number of executive orders issued by Franklin Roosevelt were a surprise to the group. Many people forget that FDR had to deal with the massive challenges of the Great Depression and WW2. His detractors often accused him of governing through executive order. His most infamous executive order interned Japanese-Americans during WW2. Many feel that the 22nd amendment, which limits presidents to two terms, was passed in response to Roosevelt's use of executive power during his four terms.

Here is my source for those of you who like to follow along.

Posted in Civics Through Spreadsheets, Excel, History Through Spreadsheets | 3 Comments

History Through Spreadsheets: Supreme Court Confirmations

 

Quote of the Day

You can no more win a war than you can win an earthquake.

Jeannette Rankin. This quote gets me thinking about recent history – every US president ought to think hard about this quote. There is a related quote from Lawrence Korb that I like which says "Wars don't solve problems, they determine who will solve the problems."


Introduction

Figure 1: Supreme Court Confirmations By President.

Figure 1: Supreme Court Confirmations By President. (Data Source)

I recently gave a seminar to my staff on using Excel with Power Query. As part of the seminar, I presented a number of web scraping examples that were well received, and I decided that some of you may appreciate them also.

When I was training myself on web scraping, I needed some free and interesting examples on which to practice. The most sophisticated work I have done involved scraping geographic data from the Wikipedia (example). For the seminar, I needed some examples that were simpler. Since I am interested in history and politics, I decided to focus on two broad categories:

  • History: WW2 is a particularly rich area, but there are many others.
  • Civics: Primarily fact-checking politicians – the amount of political lying lately is breathtaking.

Last night, I watched President Trump announce his nomination for the Supreme Court seat that was held by Antonin Scalia. While watching the announcement, I decided to grab some data on the number of Supreme Court confirmations by president and plot the data (Figure 1). There are some interesting facts to glean from this graph:

  • George Washington got so many appointments because he was first and starting with an empty court – no one will ever be first again. The Supreme Court initially only had six members, but there was some churn during his administration. The number of justices did not settle down to nine until 1866. At one point, it was as large as ten.
  • Franklin Roosevelt got so many appointments because he was president the longest. Even with all his appointments, he was not happy with the Supreme Court justices at the time – historians still talk about his attempts to "pack" the court by increasing its size.
  • Four presidents had no Supreme Court appointments – Jimmy Carter was the last president to have no appointments.
  • I would suspect we will see fewer appointments each presidential term because the justices are being nominated when they are relatively young and they serve until they are quite old. Presidents like to appoint young justices because it extends their legacy.

For those who are interested, my source file is here.

Posted in History Through Spreadsheets | Leave a comment

US Manufacturing Employment Versus Time

 

Quote of the Day

Character is a diamond that scratches every other stone.

Cyrus Bartol


Figure 1: Total US Manufacturing Employment Since 1939.

Figure 1: Total US Manufacturing Employment Since 1939. (Data Source: Bureau of Labor Statistics)

The only television news program that I watch is the PBS Newshour. I particularly like the discussions between Mark Shields, a reasonable liberal, and David Brooks, a reasonable conservative. On inauguration day (20-Jan-2017), they had an interesting discussion about the challenges the US faces and what can be done about them.

During the discussion, Mark Shields made a statement on US manufacturing employment that was both dramatic and easy to fact check. He stated that

In 1940 there were 137 million people in the Unites States of America –and ah, 132 million – and there were 600,000 more factory jobs than there are today. There were 8 million more factory jobs in this country when Jimmy Carter was president.

I decided to surf over to the US Bureau of Labor  Statistics (BLS) and see what the official statistics say. The BLS tracks manufacturing employment – I could not find "factory employment." Assuming that Shields was talking about manufacturing employment, I plotted the BLS data in Figure 1. Here is what I gleaned from these numbers:

  • There are actually 1.4 M more manufacturing jobs at the end of 2016 then in 1940. So Shields is wrong on this point.
  • Shield's overall point is that manufacturing employment has not tracked with population – he is absolutely correct about that. The US has roughly 320 million people today, which is nearly 2.5x the US population in 1940. If manufacturing employment had tracked with population, we would have 26.3 M people employed in manufacturing today rather instead of the 12.3 M we actually have.
  • He is right about manufacturing employment peaking during the Carter years and the number of manufacturing jobs was nearly 8 M more than today.
  • It is unclear to me how many manufacturing jobs were lost because companies decided to move overseas versus structural changes in our economy that reduced the need for manufacturing workers. For example, Fortune magazine is reporting that 88% of the US manufacturing jobs lost were due to automation and local factors, not international competition.
Figure 2: Bishman Tire Changer.

Figure 2: Bishman Tire Changer. (Source)

I am from a small town (Osseo, MN) where employment was divided between manufacturing and agriculture. The main manufacturing employer was a company called Bishman Manufacturing, which made an automobile tire changer (Figure 2). I remember when they closed shop in Osseo (1970s) and moved to South Dakota to find lower wage workers. A loss of jobs like that is devastating for a small town. Most of the workers refused to move, with many of the older workers taking low-pay jobs to try to bridge them to retirement.

If you wish to see the actual statement by Mark Shields, I have included a video link in Figure 2. For those who want to check my work, here is my source file.

Figure 2: Mark Shields Statement Starts at 14:45 minutes into the Video.

 

Save

Posted in History Through Spreadsheets, Statistics | 4 Comments

Slide Rules of the Rocket Pioneers

 

Quote of the Day

Too often man handles life as he does the bad weather. He whiles away the time as he waits for it to stop.

Alfred Polgar


Figure 1: Typical Student Slide Rule from the 1960s. Mine was very similar.

Figure 1: Typical student slide rule from the 1960s. Mine was very similar. The minute calculators became available in the 1970s, I bought one – for about $150. That was a bunch of money for a high-school kid back then. (Source)

I was watching a interview with Valerie Neal, Curator and Chair National Air and Space Museum, on CSPAN. The interview was focused on the history of rocket development in both the US and Soviet Union. Valerie was asked what was her favorite artifact at the National Air and Space Museum. She responded that she liked artifacts that were the personal items of the pioneers. In the case of space travel, she said that the slide rules of rocket pioneers Wernher von Braun and Sergei Korolev were her favorite artifacts. Both men used the same type of slide rule. As I looked closely at the slide rules (Figures 2 and 3), I realized they were the same brand – Nestler – as used by some engineers I knew as a boy. My slide rule was a Pickett, similar to that shown in Figure 1.

Figure 2: Slide Rule of von Braun. Figure 3: Slide Rule of Korolev.

Unfortunately, the photos are not high enough resolution to read the specific model numbers of the slide rules in Figures 2 and 3. I have read in this document that von Braun preferred the 9 scale, Nestler 23R slide rule – which was also used by Einstein. Figure 4 shows a more detailed photo of this slide rule.

Figure 4: Detailed Photo of Nestler 23 Slide Rule.

Figure 4: Detailed Photo of Nestler 23 Slide Rule. (Source)

For those who want more background on slide rules, here is an excellent document.

Save

Save

Posted in History of Science and Technology | 2 Comments

Futurama Quote on Largest Buggalo Ranch on Mars

 

Quote of the Day

A new idea comes suddenly and in a rather intuitive way. But intuition is nothing but the outcome of earlier intellectual experience.

— Albert Einstein


Figure 1: Buggalo Ranching on Mars' Western Hemisphere.

Figure 1: Buggalo Ranching on Mars' Western Hemisphere. (Source)

I am a big fan of Futurama – its going off the air was as disappointing for me as discovering Firefly a few years ago and having only 14 episodes to watch. I love the fact Futurama often includes small bits of real math and science in its scripts. My all time favorite piece of math includes a blackboard showing an actual proof using group theory by Sweet Clyde in the episode "The Prisoner of Brenda."

Today, I was watching the episode "Where the Buggalo Roam," which mentioned that the area of the Wong ranch encompassed the western hemisphere of Mars.

Here is the exact quote:

Farnsworth: This is quite a large ranch you have.
Mr. Wong: 17.9 billion acres. We own entire western hemisphere. [whispering] That the best hemisphere!
Farnsworth: It's the same on Earth.

I did the following calculation that confirmed that Futurama got the Mars hemispherical area calculation correct – these folks take their science and math seriously.

Figure 2: Calculations Showing Futurama Computed the Hemispherical Area of Mars Correctly.

Figure 2: Calculations Showing that Futurama Computed the Hemispherical Area of Mars Correctly.

Posted in Humor | 2 Comments