Naked and Afraid – Who Taps Out More? Men or Women


Quote of the Day

The best of ideas is hurt by uncritical acceptance and thrives on critical examination.

— George Polya, mathematician

Figure 1: Locations of 2-Person Naked and Afraid Episodes.

Figure 1: Locations of 2-Person Naked and Afraid Episodes.

I have been receiving quite a few questions about my database of Naked and Afraid information. I only track the shows with 2-person teams – I am not a big fan of the XL episodes. There recently has been a number of shows where women have been particularly strong leaders and finishers, and some big, tough-looking men struggled right from the start and failed to complete the challenge.   In response to some of the recent shows, a number of people have asked "Who taps out more – men or women?" So I have updated my Excel Workbook with the data from recent episodes and went to work. I have also updated Figure 1, which shows where the episodes were filmed.

I filtered my database for episodes where a woman finished and a man did not, or vice versa. This process generated a list of solo finishers and the number of days they were solo (Figure 2). Seventeen people finished the challenge alone, 9 men and 8 women.

igure 2: List of Solo Finishers and the Number of Days Solo.

Figure 2: List of Solo Finishers and the Number of Days Solo.

Figure 3 shows the summary statistics for the solo finishers. While there have been some strong female solo finishers lately, overall men and women have had similar solo performances, at least with respect to the average days solo and the number of solo finishers.

Figure 3: Summary Statistics of Solo Finishers.

Figure 3: Summary Statistics of Solo Finishers.

I am seeing a pattern in how men tap out that I do want to explore, but I am not certain how to obtain the data to prove it. It seems that the big, muscular men do not do as well as the smaller, wiry men. I do not have data on the size of the men, so I cannot perform this analysis right now – I am thinking about how to gather that data.









Posted in Naked and Afraid | 2 Comments

Royal Navy Losses in WW2


Quote of the Day

Figure out what you are not good at, and assume it will not change, and then hire someone to compensate.

John Sexton, quoted by Anne Marie Slaughter. This is critical advice for any manager. You personally do not have to be good at everything, but you must be able to attract and retain a group of people that can do everything.

Figure 1: Pivot Table of Royal Navy Losses During WW2.

Figure 1: Pivot Table of Royal Navy Losses During WW2. (Data Source)

I have been doing quite a bit of reading lately on WW2 naval actions, and I have been putting together tables that show me ship losses by year. This information gives me a feel for the tempo of battle during the war. I first looked at US naval losses (link) and am now looking at the Royal Navy losses (Figure 1).

A key difference  between US and UK naval losses is when the loss peak occurred. The peak losses would occur when (1) there were significant operations in progress, and (2) both sides had significant numbers of combatants. The US rate of loss peaked during 1944, while the Royal Navy's losses peaked during 1942.

In 1942, the Royal Navy was quite focused on Atlantic convoy protection, and they would have had their hands full during the Second  Happy Time of the German U-boat fleet. After 1942, British anti-submarine warfare operations became so efficient that serving on U-boat was almost a death sentence. In 1944, US naval losses would have peaked while the US Navy was focused on island hopping and the Japanese still had a significant navy. By 1945, the Imperial Japanese Navy (IJN) was a hollow force, and there were few IJN ships to engage US forces.

For those who are interested, my spreadsheet is here – along with the data, which I downloaded from a web page. The analysis was performed using Power Query, which is my primary tool for small data sets. To get a quick feel for the data, focus first on the sparklines at the right-hand side of the chart.


Posted in History Through Spreadsheets, Military History | Leave a comment

A Little Beer Data


Quote of the Day

Ladies and gentlemen, this is your captain speaking. We have a small problem. All four engines have stopped. We are doing our damnedest to get them going again. I trust you are not in too much distress.

— Pilot message on British Airways flight 9 from Singapore to Australia as it flew next to an erupting volcano near Indonesia, and ash shut down all four of the 747’s engines.


Figure 1: Label of the Most Bitter Beer.

Figure 1: Label of the Most Bitter Beer in the Dataset. (Link)

I gave a seminar last week on the use of Python with Pandas using Jupyter notebooks. When I give a seminar, I always have a worked example to illustrate the points that I am trying to make. This attendance at this seminar had a number of young men in it, so I decided to focus on a topic near and dear to their hearts – beer. I was amazed at the amount of beer knowledge showed by some individuals in the audience. Clearly, beer is a major part of their lives. I myself know NOTHING about beer, so I had to educate myself. The data provide interesting to analyze, and I will probably be augmenting my seminar with additional data over time.

My objectives during this seminar were on how to use Python with Pandas to:

  • illustrate how to import beer and brewery data files into a DataFrames
  • cleanup the imported data
  • illustrate how to merge (i.e. join) tables
  • graph the data
  • analyze the data using pivot tables

The presentation was well received, and I felt the information would be worth posting here.

I should mention that I have reduced my blogging rate because I have been busy preparing to build a retirement home in Northern Minnesota. I have never contracted a custom home before and the effort has been enormous and all-consuming. I will discuss my construction project in series of posts once I get everything kicked off – no heavy equipment moves in Northern Minnesota until the ground thaws.

For those of you who like to work along, I include my Jupyter notebook, data files, and an Excel version of this analysis here.



A brewery that produces small amounts of beer, typically much smaller than large-scale corporate breweries, and is independently owned. Such breweries are generally characterized by their emphasis on quality, flavor and brewing technique. The definition of a small amount of beer varies by state. For example, Missouri limits a microbrewer to no more than 10,000 barrels a year. (Link)
Craft Brewer
A term for the developments succeeding the microbrewing movement of the late 20th century. The definition is not entirely consistent but typically applies to relatively small, independently-owned commercial breweries that employ traditional brewing methods and emphasize flavor and quality. The term is usually reserved for breweries established since the 1970s but may be used for older breweries with a similar focus. For tax purposes, craft breweries produce less than two million barrels of beer a year. (Link)
International Bittering Units (IBU)
A metric used to approximately quantify the bitterness of beer. An IBU is measured in parts-per-million (ppm) of isohumolone, the main chemical compound derived from hops that makes beer taste bitter. Isohumulone is created when the alpha acids in hops isomerize, or breakdown, in the boil. (Link)
Alcohol By Volume (ABV)
ABV is defined as the number of millilitres of pure ethanol present in 100 millilitres (3.4 US fl oz) of solution at 20 °C (68 °F). The number of millilitres of pure ethanol is the mass of the ethanol divided by its density at 20 °C, which is 0.78924 g/ml. The ABV standard is used worldwide. (Link)
Proof is a measure of the content of ethanol (alcohol) in an alcoholic beverage. The term was originally used in the United Kingdom and was equal to about 1.75 times the alcohol by volume (ABV). The UK now uses the ABV standard instead of alcohol proof. In the United States, alcohol proof is defined as twice the percentage of ABV. (Link)
An alpha acid generated when hops breaks down during boiling. (Link)


Data Source

All the data for this analysis comes from this web site. I must applaud the author of this web site for the excellent illustration he provided on the work involved in scraping data from a complex web page. I do much web scraping myself, and his description of how to gather the data is the best I have seen.

Breweries with the Most Labels

As I looked at the data, I noticed that some breweries had an enormous number of beer names in their portfolio. So I generated Figure 1, which is a table of top ten breweries by number of beer names (also known as labels).

Figure 1: Craft Breweries with the Most Labels.

Figure 2: Craft Breweries with the Most Labels.

Beers By Alcohol Content (ABV)

The alcohol content of beer comes up occasionally and this data set allowed me to look at the distribution of alcohol content by volume (i.e. ABV). First, I looked at the top ten beers by ABV (Figure 3). Note that many beers had ABV values of 9.9%, so the beers you see in the table simply sorted higher than the others.

Figure M: Top 10 Craft Beers By ABV.

Figure 3: Top 10 Craft Beers By ABV.

The beers in Figure 3 are actually outliers. The bulk of craft beer ABV values are in the 5% range (Figure 4).

Figure M: ABV Histogram.

Figure 4: ABV Histogram.

My preferred way to view empirical distributions is with a violin plot, which I show for ABV values in Figure 5. The violin plot also allows you to show quartiles, which are shown as dashed lines.

Figure M: ABV Violin Plot.

Figure 5: ABV Violin Plot.

Cities With The Most Craft Breweries

Figure 6 shows the top ten cities by craft brewery number. I was not surprised at all to see Portland OR and Bend OR on the list.

Figure 6: Cities with Most Craft Breweries.

Figure 6: Cities with Most Craft Breweries.

Top 10 Beers By Bitterness (IBUs)

Figure 7 shows the top ten beers by bitterness. After I presented this table, staff members started talking about how they could obtain a small stock of these beers. I am floored that there are people who seek out bitter beer.

Figure 7: Top Ten Beers By IBU.

Figure 7: Top Ten Beers By IBU.


The reviews of the seminar were good! My team uses Python for its automation language, but has not been using Pandas or Jupyter notebooks – they still use Excel to present their test data. I am hoping that my presentation motivates them to consider Jupyter notebooks and Panda so that all their work can be done without switching software.

I added Figure 8 because I found it interesting that some smaller brews are becoming commonly available in many states. For example, Summit products are available everywhere in Minnesota.

Figure M: Beers Most Likely on a Menu By State.

Figure 8: Beers Most Likely on a Menu By State. (Link)

Posted in Python, software | Leave a comment

US Naval Losses in WW2


Quote of the Day

Every time you find your attention captured by an advertisement, your awareness, and perhaps something more, has, if only for a moment, been appropriated without your consent.

— Tim Wu

Figure 1: US Naval Losses in WW2. (Source)

I have been putting together some information on US naval actions during WW2. Specifically, I wanted to look at US Naval losses by year during WW2 in order to get a feel for the change in battle tempo over time. The Wikipedia has an excellent page on all the US naval losses during WW2, so I simply downloaded this page, cleaned it up, and generated an Excel pivot table (Figure 1). The breakdown by combatant type is my own, everything else is from the Wikipedia.

To help illustrate the loss rate by year, I added sparklines on the right-hand side of the table. The sparklines do make it clear that the pace of battle was high during 1942 and 1944. It is difficult to assess the tempo of naval losses in 1945 from this chart because (1) 1945 was a partial year of WW2, and (2) the Imperial Japanese Navy was a shell of its former self during most of 1945.

There is nothing particularly special about this data or my analysis, but I thought I would make the spreadsheet available to others in case they were interested in looking at WW2 data – there are a fair number of discussion groups on this topic. Having the data in a spreadsheet makes it easy to work with.

Posted in History Through Spreadsheets | 7 Comments

Word of the Day: Omphaloskepsis


Quote of the Day

Mastery lives quietly atop a mountain of mistakes.

— Eric Greitens, in Resilience. This statement could not be more true. It is similar in spirit to a quote by the great physicist Niels Bohr, "An expert is a man who has made all the mistakes which can be made in a very narrow field."

Figure 1: Four People Contemplating Their Navels. (Source)

Figure 1: Four People Contemplating Their Navels. (Source)

I was sitting in a management meeting today that seemed to be rather unproductive. It ended up in a philosophical discussion that did not go anywhere. I commented that we seemed to be engaging in omphaloskepsis, which is the name for the ancient Greek practice of contemplating one's navel (Figure 1). I first heard this word at Orbital ATK, where it was used to describe some of the meetings there.

No one else in today's meeting had heard the word before, but we all agreed that it bore some relationship to what occurred during our meeting. I thought I would share this word with you – some of you may sit in meetings like this.

Posted in Management | 2 Comments

Television Analogies to Working for a Startup


Quote of the Day

All governments suffer a recurring problem: Power attracts pathological personalities. It is not that power corrupts, but it is magnetic to the corruptible.

— Frank Herbert, Dune

Figure 1: Star Trek is analgous to a startup in the beginning.

Figure 1: Star Trek is analogous to a startup in the beginning. (Source)

I was talking to an old friend the other night about the positives and negatives associated with working for a startup company. Overall, we both enjoyed working with startups enormously, and I would seriously consider joining another. However, both of us understand the special challenges that startups face.

My friend compared being part of a startup to various well-known television shows, and I would mention his thoughts here.

With a big smile on his face, he says that most startups go through three phases:

  • Beginning: Star Trek

    We are starting on a multi-year mission to boldly go where no one has gone before. Everyone is excited, you have a plan, and the sky is the limit.

  • Middle: Survivor

    Steven Blank defines a startup as

    a temporary organization searching for a repeatable and scalable business model.

    This search for a business model often involves people with strong opinions who want to influence the direction of the company. These folks frequently try to form alliances within the organization – just like Survivor. This is often a stressful time.

  • End: ER

    At some point, most startups start looking for a buyer/white knight/angel investor because they are having money problems. I have been there – there is nothing quite like having paying customers and no working capital for building product. There were days when you feel like you are in an ER and need someone to apply the paddles.

I had to laugh because his observations struck close to home.



Posted in Management | Leave a comment

Graphic Depicting the Need for Succession Planning


Quote of the Day

If men are to be precluded from offering their sentiments ... the freedom of speech may be taken away, and dumb and silent we may be led, like sheep to the slaughter.

— George Washington


Figure 1: NASA Workforce Age. (Source)

Figure 1: NASA Workforce Age. (Source)

I recently had an employee retire in my group that caused me to look at the age distribution within our entire HW organization. After seeing the age of our engineering staff, I made a proposal to our management team for ensuring that the skills of our senior staff members were being transferred over time to our junior staff members. This post shows how I presented the age information to internal management. The presentation was successful, and I thought it would be useful to show here.

The data presented changes the names of the managers, and the data itself has been altered for privacy reasons. However, the overall message of the data is the same – nearly half of our engineers are 55 year old or older. These are also our most skilled employees. We need to begin working on transferring their skills to more junior staff. We are not the only companies faced with the graying of their workforce – NASA has been working on the problem for a number of years (Figure 1).



At my company, I am not allowed to know the age of individual employees. However, I can request data on the individual ages within Hardware Engineering without names attached. I requested this data and was able to generate a plot and produce some useful tables, which I show below. Note that I have modified the names of managers and altered the exact age data, but kept the overall message the same.

I arbitrarily chose 55 years as the age at which we need to beginning considering succession planning for an employee.


I analyzed the data using Rstudio. The actual tables were generated using Excel, because I like the look of Excel pivot tables. The raw source files are included here.


Employee Age Distribution

Figure 2 was the chart that generated the most discussion. It showed just how many engineers that we have who are 55 or over. Note that some people are the same age, and I used ggplot2's jitter feature to show people of the same age by adjacent dots.

Figure 1: Employee Age Distribution.

Figure 2: Employee Age Distribution.
I jittered the dots both horizontally and vertically to show employees of the same age.

Employee Age Percentages

After showing the chart above, I then presented Figure 3, which shows a table of the percentages of our employee ages aggregated by manager and age group (less than 55, and 55 and over).

Figure 2: Pivot Tables Showing Employee Age Relative to 55 years.

Figure 3: Pivot Tables Showing Employee Age Relative to 55 years.


With nearly half of our hardware engineers with an age of 55 or over, we have quite a bit of work ahead of us.

Posted in Management | 2 Comments

Optical Fiber Attenuation Specifications


Quote of the Day

Son, this is a Washington, D.C. kind of lie. It's when the other person knows you're lying, and also knows you know he knows.

Allen Drury, Advise and Consent


Figure M: Plot of Fiber Attenuation Data.

Figure 1: Plot of Fiber Attenuation Using Different Approaches.

I needed to estimate the loss on a fiber network today – something that I have done hundreds of times before. However, today was a bit different because I decided to look at how sensitive my results were to my assumptions on when the fiber was deployed. I was a bit surprised to see how much fiber has improved with respect to losses due to contamination by OH molecules, a problem often referred to as the water peak.

This post graphs fiber loss data (Figure 1) based on:

  • Corning SMF-28e fiber specification, a modern G.652-compliant fiber. This fiber has been around since the early 2000's.
  • empirical data from 1990 fiber deployments using G.652-compliant fiber
  • empirical data from  2000 fiber using G.652-compliant fiber
  • empirical data from 2003 fiber using G.652-compliant fiber
  • a common equation-based model.

I thought the results were interesting and worth sharing here (Figure 1).



attenuation (aka loss)
Attenuation in fiber optics, also known as transmission loss, is the reduction in intensity of the light beam (or signal) with respect to distance traveled through a transmission medium.  (Source)
attenuation coefficient
Optical power propagating in a fiber decreases exponentially with distance by the formula P\left( z \right)={{P}_{o}}\cdot {{e}^{{-\alpha \cdot z}}}, where P(z) is the optical power at distance z, P0 is the launch power, and α is the attenuation coefficient. We normally express the attenuation coefficient in terms of dB/km, which allows us to compute system losses using simple addition.
water peak
A peak in attenuation in optical fibers caused by contamination from hydroxyl (OH) ions that are residuals of the manufacturing process. Water peak causes wavelength attenuation and pulse dispersion in the region of 1383 nm.  (Source)

Loss Modeling

Fiber optics losses are modeled by assuming a fraction of the light power is lost through each component. These losses are expressed in terms of dB. For example:

When expressed in dB, the losses can be added to provide a total loss. For more modeling information and an example, see this page.


Corning Loss Model

I normally use the Corning loss model because virtually all my customers use Corning SMF-28e fiber. To assist customers with estimating fiber loss per km, Corning provides a spreadsheet (it contains a macro) with a simple model that uses the loss at a long wavelength, short wavelength, and at the water peak. For normal work, I use the values shown in Figure 2, which is from the SMF-28e specification sheet.

Figure M: Typical Loss Values for SMF-28 Fiber.

Figure 2: Typical Loss Values for SMF-28 Fiber.

I should mention that the raw fiber attenuation is slightly lower than the loss after it is been put into a cable. This is because the act of cabling tends to add microstresses (also known as microbends) to the fiber that increase its attenuation. At my company, we assume that the cabling penalty is ~0.05 dB/km.

ITU-T G.652 Compliant Cable

The ITU has published a standard for optical fiber called ITU-T G.652. They have supplemented this standard with a document that contains measured data for cabled fiber. This data is interesting because it provides information I have never seen elsewhere:

  • fiber loss per km plus the standard deviation of the loss (i.e. variation across fiber segments). I often need to estimate the "worst-case" fiber loss and the standard deviation allows me to use the RSS method.
  • fiber loss per km for fiber installed during different years (1990, 2000, 2003). The fiber loss per km is different between the three years, particularly at the water peak. This data shows that fiber is greatly improved with respect to the water peak.

I should mention that G.652 sets the minimum standard for fiber. Manufacturers often compete by having a better attenuation coefficient, ability to handle more power before the onset of nonlinearities (e.g. SMF-28e+), zero water peak, or supporting a tighter bend radius (e.g. bend insensitive fiber).

Equation-Based Loss Model

I occasionally see people model fiber loss (example) using Equation 1, which ignores the water peak . Equation 1 assumes that the attenuation versus wavelength is entirely due to Rayleigh scattering, which is accurate if you ignore the water peak.

Eq. 1 \displaystyle \alpha \left( \lambda \right)=\frac{{{{R}_{{{{\lambda }_{0}}}}}}}{{{{C}_{{{{\lambda }_{0}}}}}}}\cdot \left( {{{{\left( {\frac{1}{{9.4\cdot {{{10}}^{{-4}}}\cdot \lambda }}} \right)}}^{4}}+1.05} \right)


    • Rλ0 is the attenuation factor at the reference wavelength λ0.
    • Cλ0 is a constant that varies with the reference wavelength λ0.
    • λ is the wavelength at which I want to compute the attenuation factor.
    • α is the attenuation coefficient at λ.


Fiber deployments generally avoid wavelengths near the water peak of 1383 nm  because of the excess loss there.  I thought it was interesting to see how the water peak has changed so much over the last 17 years. Note that Corning's specifications show its water peak is larger than was measured on the empirical data from the 2003 deployment. I assume this is because they measured some fiber networks with zero water peaks, which would drop the average.

The Equation 1 model works well as long as you are far from the water peak. I will be using Equation 1 for my simple modeling tasks because:

  • None of my wavelengths are near the water peak.
  • Equation 1 is easy to evaluate in Excel, and all of my customers have Excel.
  • It is quite accurate for wavelengths far from the water peak.


Posted in optics | Leave a comment

Greek Mythology is Relevant to Engineering Management


Quote of the Day

Curiosity is, in great and generous minds, the first passion and the last.

— Samuel Johnson

Figure 1: Artist's Imaging of Cassandra.

Figure 1: Artist's Imagining of Cassandra. (Source)

The older I get, the more I see the relevance of the classics to modern life. As a boy, I read a children's version of Aesop's fables, which I loved and are still relevant to daily life. Later in school, I read about Greek mythology from a book called Mythology: Timeless Tales of Gods and Heroes by Edith Hamilton. I still have a personal copy of this book that I refer to occasionally. It may seem odd, but the more time I spend in engineering management, the more relevant these myths seem to become.  The last two weeks I have mentioned two Greek myths several times – the tales of  Cassandra and Sisyphus. They seem particularly appropriate to modern management.

Cassandra (Figure 1)  was a princess who was given the power of prophesy by Apollo, but when she spurned his advances, he inflicted a curse upon her where no one would believe her. I frequently have engineers tell me that they had warned someone about some hazard, but their warning went unheeded and the worst occurred. I often refer to these engineers as "Cassandras." All I can tell these frustrated souls is that their obligation is to warn their coworker, but that ultimately their coworker owns their decisions. The most irritating response I have received after warning someone about a risk that was realized is that I should have been more vehement in stopping them. I can only do so much …

Figure 2: Artist's Imaging of Sisyphus.

Figure 2: Artist's Imagining of Sisyphus. (Source)

The other Greek myth that comes up often is that of Sisyphus (Figure 2), who was a very clever king who was cursed by Zeus for his cleverness by making him endlessly roll a huge boulder up a steep hill. Just as the boulder was to reach the top of the hill, it would somehow find a way to roll all the way down to the bottom of the hill, and Sisyphus would be forced to repeat his labor. Sisyphus has come to be a metaphor for any pointless activity that goes on forever. Unfortunately, many engineering projects have a phase where they seem interminable.

I can illustrate this point by recalling a large program at HP that had the code name "Touchstone," a metaphor for a product that will set a new standard for the industry. After it had gone on for a couple of years, engineers started to call it "Millstone," a reference to a bible verse about a man thrown in the water with a millstone around his neck (Luke 17-2). Another year later, they were calling the program "Tombstone," recalling images of death. This is just how some programs go.

Posted in Management | 2 Comments

An Example of Cleaning Untidy Data with Tidyr


Quote of the Day

I wish to do something great and wonderful, but I must start by doing the little things like they were great and wonderful.

— Albert Einstein


Figure 1: P-61 Black Widow, the most expensive Army Air Force Fighter in WW2.

Figure 1: P-61 Black Widow, the Most Expensive Army Air Force Fighter of WW2. (Source)

I recently decided to take some classes in data analysis at Datacamp, an online training site. My first classes were in dplyr and tidyr – two excellent R-based tools for manipulating files that are not amenable to analysis because of inconsistencies and structure: tidyr provides many tools for cleaning up messy data, dplyr provides many tools for restructuring data. After completing the two classes, I decided that I needed to firm up my knowledge of these tools by applying them to a concrete example.

The gnarliest data that I know of is hosted on the Hyperwar web site. The data seems to be scanned and OCRed WW2 records that were translated into HTML. While a fantastic resource for history aficionados, these records are painful to analyze using software because of:

  • Numerous OCR errors
    The original source is often difficult to figure out.
  • Inconsistent expression of zeros
    I see blanks, zeros, periods, dashes.
  • Mixing data and totals in seemingly random ways
  • Arithmetic errors (e.g. incorrect totals)
    I often think of some poor secretary/yeoman typing numbers all day and then having to use an adding machine to complete a total. It would have been miserable work.
  • Typing errors
    For example, a cargo ship being listed as 70,006 tons during WW2 was most likely 7,006 tons. At that time, cargo ship were typically in the 10,000 ton range. This error actually occurred. I was able to confirm the error because the Wikipedia has a remarkable amount of data on WW2 ships, and it listed 7,006 tons for this ship.
  • Unbelievably complex headings (literally three and four levels)

This is not a complaint about the Hyperwar site itself. This is just a statement about the format and quality of WW2 records that were written by humans typing numbers at keyboards. For this post, I decided to download  data on the cost of WW2 Army Air Force combat aircraft and clean it up for graphing. All of the heavy lifting is done in RStudio – only the Appendix is done in Excel because I like how it formats tables.


Problem Statement

I have read quite a bit about air operations during WW2, and I often have read that the P-47 and P-38 were very expensive fighters compared to the P-51. One reason given for  the P-51 replacing the P-47 and P-38 in many applications was its cheaper unit cost. I decided that I would research the cost of these aircraft as a motivating application for practicing with tidyr and dplyr.


tidy data
Tidy data is a term coined by Hadley Wickham (creator of tidyr and dplyr) to describe data that is formatted with variables in columns and observations in rows with no blank rows or columns. Missing data is clearly marked in a consistent matter, as are zeros. If you are a database person, you will see hints of his concepts in Codd's Rules for relational databases.
untidy data
Untidy data is data that is not tidy – I hate definitions like this, but it works here. There are some common problems:

  • Values of variables used for column headings (e.g. column headings containing years – years should be a column with specific year values stored for each row).
  • Data that should be in multiple tables combined into one table. (e.g. describing a person's pets means multiple table rows for people with more than one pet – the pet details should be in a separate table.)
  • Aggregate data (e.g. totals) in the rows.
  • Blank rows or columns.
  • Single observation scattered across multiple rows.


All costs are expressed in WW2 dollars. The point of this exercise is to provide a worked Rmarkdown-based example of cleaning an untidy data file. Converting WW2 dollars to modern dollars is fraught with issues, which I leave to others.

Rmarkdown File

The Rmarkdown file is a bit long, so I include a PDF of it here as a link. For the complete source, see this zip file. The zip file contains the raw Hyperwar tab-separated file, a cleaned-up .csv, Rmarkdown source, and a PDF of the processed Rmarkdown document.


I am not interested in plotting all the data. There are just a few WW2 aircraft in which I am interested. I plotted the costs of my favorite WW2 fighters and bombers versus time. Note that some of the costs dropped dramatically over time. This is true for any product (example).

Fighter Plane Costs Versus Time

Figure 2 show how the unit costs changes with time for my favorite WW2 fighter planes. Note that the P-38 and P-47 are quite expensive relative to the P-51. The most expensive US fighter in WW2 was the P-61 night fighter (Figure 1).

Figure M: WW2 Army Air Force Fighter Plane Costs Versus Time.

Figure 2: WW2 Army Air Force Fighter Plane Costs Versus Time.

Bomber Plane Costs Versus Time

Figure 3 show how the costs changes with time for my favorite WW2 bombers. Notice how the B-29 was ~2.6x more expensive that the B-17. However, only the B-29 had the range needed to bomb Japan from the Marianas Islands.

Figure M: Army Air Force Bomber Costs in WW2.

Figure 3: WW2 Army Air Force Bomber Costs in WW2.

Cleaned-Up CSV File

The Rmarkdown file outputs a cleaned-up version of the data as a .csv file. I include that file in with my source. You can see a tabular version of the data in Appendix A.


I work in corporate America, where Excel is the standard for processing data. While Excel has some strengths, it is not the most powerful data analysis tool available. RStudio provides a wonderful data analysis and presentation environment. This post provides a fully worked example of how to use RStudio with tidyr and dplyr to tidy-up some very messy data.

Appendix A: Hyperwar Aircraft Cost Table Reconstruction.

Figure 4 shows how my tidy data version of the cost data can easily be converted back into the original form.

Figure M: Import of Cleaned Table Into Excel for Table Presentation.

Figure 4: Import of Cleaned Table Into Excel for Table Presentation.








Posted in Excel, History Through Spreadsheets | 2 Comments