WW2 Tank Production Comparison Between Combatants


Quote of the Day

At least you figured it out. My first two husbands never did.

— Female employee told me this after I mentioned that my wife's nickname for me was "The Paycheck" – this nickname was a joke between my wife and me.


Figure 1: Soviet T34/85 Medium Tank.

Figure 1: Soviet T34/85 Medium Tank. (Source)

I occasionally watch WW2 documentaries on television, and one of these programs provided a bit of history on the Battle of Kursk. This battle may have been the greatest clash of armored forces in history. The coverage highlighted the role of medium tanks, like the Soviet T34 (Figure 1) and the German Panzer Mk V (aka Panther [Figure 2]), and heavy tanks, like the German Panzer Mk VI (aka Tiger I). One of the themes of the documentary was the massive size of Soviet armored forces compared to their German opponents. This discussion piqued my curiosity about the number of tanks produced by the major combatants.

I performed a quick web search and soon found that the Wikipedia has data for the Soviet, German, and American armor production in an odd, but usable form. The articles had all the data I wanted but in separate tables that would need to be cleaned up separately, consolidated, and then turned into pivot tables. I very much wanted to include UK tank data, but what I found had a weird format and decided to skip including it.


Points of Interest

Figure 2: Panzer_Mk_V_Panther

Figure 2: German Panzer Mk V – Panther Tank. (Source)

I am interested in three things:

  • Total tank production by all parties separately.
  • Total tank production of Allies versus Germans.
  • Weight mix of tanks produced (light, medium, and heavy).

My interest in the mix of tank weights is driven by the first-person accounts I have read from US veterans that show great concern about encountering heavy tanks, like Tiger Is and Tiger IIs. In the case of the US, the armor philosophy was focused on medium tanks, specifically the M4 Sherman. The lone US heavy tank, the M26 Pershing, was not deployed until late in the war. This decision is a major topic of discussion even today with some first-person accounts referring to the Sherman tanks as a deathtrap. Opposing arguments are that the Sherman was not that bad (particularly the Easy 8 and Firefly variants) and that quantity has a quality all its own. There is some empirical support for this position.

  • The Sherman Easy 6 performed well against the Soviet T34/85 in Korea.
  • Shermans performed well at the Battle of Arracourt against German forces with a large number of Panthers.
  • The Soviets used a large number of Shermans and all reports I have seen indicate that they liked them.

I have to mention that making direct comparisons of armor effectiveness is difficult because  training and operational tactics are critical to achieving good field performance.

Tabular Presentation

Using Excel and Power Query, I grabbed data from three Wikipedia pages: US armor production, Soviet armor production, and German armor production. The data includes both tanks and self-propelled guns. I consolidated the data into a single pivot table (Figure 3).

Figure 3: German, Soviet, and US Armor Production in WW2.

Figure 3: German, Soviet, and US Armor Production in WW2.

While the table contains all the data available, it is a difficult format for people to interpret quickly. We need to find a graphical way to present the data.

Graphical Presentation

Figure 4 shows the relative armor production and mix of the German versus the Allies. The Allies produced more than four times more armor than the Germans did. Also note the amount of heavy armor produced by the Allies.

Figure 3: German vs Allied Armor Production Numbers.

Figure 4: German vs Allied Armor Production Numbers.

Figure 5 shows how US and Soviet production compared. The Soviets outproduced the US in total numbers, particularly in heavy armor.

Figure 4: Soviet vs US Armor Production.

Figure 5: Soviet vs US Armor Production.


Here is how I interpret this data:

  • The US and Soviets outproduced the Germans in tanks by more than a factor of 4.
  • The Soviets outproduced the US in armored vehicles by almost 20%.
  • The Soviets produced significantly more heavy tanks than both the US and Germany.
  • While quite a bit has been written about the German Tiger tanks, relatively few of them (less than 2000) were produced.

For those who are interested, my workbook is here.

Posted in Excel, History Through Spreadsheets, Military History | Leave a comment

Retired Hurricane Name Statistics


Quote of the Day

The cost of a thing is the amount of what I will call life which is required to be exchanged for it, immediately or in the long run.

— Henry David Thoreau, Walden, Chapter 1: Economy

Figure 1: Retired Atlantic Hurricane Names.

Figure 1: Retired Atlantic Hurricane Names.

I was reading an article that stated that names of Hurricane Harvey and Irma could be the eighth pair of back-to-back hurricane whose names have been retired since the 1954. I have never thought about which hurricane names have been retired, so I started to search around the web. As usual, the Wikipedia provided an excellent data source.

The modern hurricane naming system began in 1954 and the first name retirements occurred in 1955. Figure 1 shows that 83 Atlantic hurricane names have been retired as of the 2016. There are some famous names on this list. These names have a visceral response on some people. For example, my father-in-law used to shudder whenever he heard the name Camille.

You can see the kind of destruction that these storm cause by looking at the 10 retired hurricane names with the greatest loss of life (Figure 2a) and repair costs (Figure 2b).

Figure 2(a): Ten Retired Hurricane Names with the Greatest Loss of Life.
Figure 2(b): Ten Retired Hurricane Names with the Greatest Monetary Losses.

It also is interesting to look at the frequency with which hurricane names are retired. In Figure 3, I show the list down into 5 year periods that display the total number of names retired and total damage caused. You can see that the period from 1995 through 2009 was rough.

Figure 3: Hurricane Names Retired, Total Damage, and Loss of Life in Five-Year Periods.

For those who are interested, my workbook is available here.



Posted in History Through Spreadsheets | Leave a comment

Examples of Lake Clarity Variation Over Time


Quote of the Day

Put down everything that comes into your head and then you’re a writer. But an author is one who can judge his own stuff’s worth, without pity, and destroy most of it.

— Sidonie-Gabrielle Colett


Figure 1: Pattern on Freshwater Secchi Disk.

Figure 1: Pattern on Freshwater Secchi Disk. (Source)

I am having my cabin built on a small lake in northern Minnesota. At the same time that the cabin is being built, a friend is in the process of locating an an existing cabin for purchase on a nearby lake – there are dozens of lakes within a few miles of my building site. He has been asking questions about the clarity of the water in these lakes. Fortunately, the state of Minnesota has an excellent web page with all sorts of technical data on lake water, including clarity measurements. Professional lake monitors are also used. On a regular basis, they gather technical information on the lakes: chemistry, fish populations, presence of invasive species, etc. The lake water clarity data historically has been measured using a Secchi disk (Figure 1) and volunteer lake monitors. In recent years, satellites have been tasked with clarity monitoring as well. So there are now two sources of lake clarity information that can be used to cross-check one another.

Lake clarity is important because it gives you an idea as to the health of the lake, and the lake's health has a major influence on the value of your property. For me, a healthy lake supports a significant fish population, has no algae blooms, and has no significant weed issues, etc. However, low clarity does not necessarily mean an unhealthy lake. For example, a bog is the source of water for my lake. This gives the water a tea-color (referred to as "bog stain"), which reduces its clarity. However, the lake itself is quite healthy. In summary, clarity is just one of a number of useful measures of lake health.

In this post, I will be focusing on the Secchi disk data gathered by the volunteer lake monitors over the years on three lakes in Itasca County. The data acquisition process is straightforward:

  • lower the Secchi disk into the water.
  • record the depth at which you lose sight of the disk.
  • send the data to the Minnesota Department of Natural Resources (MnDNR).
  • For each lake, the MnDNR generates a CSV file containing data from both their volunteer program and their professional lake monitors. The file can be downloaded via the Internet, which is how I have configured Excel.

Lake clarity is just one parameter that volunteers gather. Another important parameter is lake level. It turns out that lake level and clarity are somewhat related. If you have have a substantial amount of rain (i.e. very clean water), the lake clarity will improve and your lake level will also rise. One issue with the volunteer program is that there sometimes are gaps in the data because volunteers are not always reliable data gatherers. Overall, I have to say the data gathering program is a great success.

For those who are interested, my workbook is available here.


My friend is strongly considering the purchase of a cabin on Johnson Lake, and I want to help him gather information for making an intelligent choice. So I grabbed the CSV files for Johnson Lake and two other nearby lakes, Eagle and Ruby, so he would have a basis for comparison.

Here is how I view these lakes.

  • Ruby Lake

    Ruby is sometimes called a "gem" of a lake because of its water quality and the small number of cabins on the lake. When I first saw Ruby Lake 20 years ago, I was very impressed with its clarity – you could clearly see the bottom. It is also has limited boat traffic because there are not many cabins on the lake. One reasons for the small number of cabins is because Ruby Lakes abutts a national forest, which means that part of the lake will never have cabins on it.

  • Johnson Lake

    Johnson is a good general use lake. The lake is near a major county road,which means access is easy in both winter and summer. Of course, this has the downside of being somewhat noisy. It has a good-sized gamefish population for those who like to fish and is large enough for those who like watersports, like waterskiing.

  • Eagle Lake

    Eagle is a small lake that is bog-stained. One side of the lake can be accessed by a major county road (i.e. easy access) and the other side of the lake has relatively poor access. It is not considered a major fishing lake, though kids love to fish on it because it has large numbers of sunfish and crappies. However, this is nice for those of us who do not fish and like our lakes quiet (i.e. me). It is large enough for watersports, but most folks just cruise around on their pontoon boats. There are relatively few cabins on the lake because much of the shoreline is owned by a few families.

Figure 2 shows the Secchi disk data for all three lakes. We can make several observations about this data:

  • Johnson Lake (green rectangles in Figure 2) has had very stable lake clarity for a long period of time. This indicates that the lake's health is stable.
  • Ruby Lake (blue triangles in Figure 2) had very good clarity back in the early '90s. No one gathered Secchi data during the early 2000s, but recent data shows that the clarity has declined down to the level of Johnson Lake. I would like to know more about why this change has occurred.
  • Eagle Lake (red circles in Figure 2) has very limited data. The data we have shows that the clarity can be low but is highly variable. This data seems to correlate with Eagle's Lakes level. Eagle has a variable level because its outflow is regularly blocked by beaver dams.
Figure 2: Excel Chart of Clarity of Three Lakes Over Time.

Figure 2: Excel Chart of Clarity of Three Lakes Over Time.


I am not sure if my friend will buy a cabin on Johnson Lake. The data shows that the lake is stable and not undergoing any stress.

Just for fun, I also plotted the data using ggplot2 (Figure 3).

Figure 3: Same Data as Figure 2 Plotted Using ggplot2 and Default Settings.

Figure 3: Same Data as Figure 2 Plotted Using ggplot2 and Default Settings.

Posted in Excel, General Science | Leave a comment

Hurricanes and Typhoons Ranked By Air Pressure


Quote of the Day

Whenever you are about to find fault with someone, ask yourself the following question: What fault of mine most nearly resembles the one I am about to criticize?"

Marcus Aurelius, Meditations

`Figure 1: Wind Speed and Air Pressure Distribution in a Hurricane or Typhoon. (Source)

Figure 1: Wind Speed and Air Pressure Distribution in a Hurricane or Typhoon. (Source)

I was watching the weather reports on Hurricane Irma and the discussions on how powerful it is. The most cited metrics for hurricanes and typhoons appears to be wind speed and eye barometric pressure (see Figure 1). I decided to look around for hurricane and typhoon strength data and the Wikipedia turned out to have a page containing large number of tables for all the most intense typhoons and hurricanes in different regions of the world. I used Power Query to (1) import the tables, (2) clean them up, (3) combine them, and (4) rank the storms by air pressure.

I have only personally experienced a weather event like Hurricane Irma once in my life. I was traveling in Taiwan when Typhoon Matmo struck. I was very impressed with the people of Taiwan and how organized they were on handling this event. The only unnerving part of the whole thing was that I was about to take a flight back to the US when a plane crashed at a nearby airport. This made me a bit queasy about how safe it was to fly back. It turned out the crash was because of pilot error.

For those who are interested, my workbook is here. Figure 2 shows the most intense hurricanes in the Atlantic basin when ranked by central barometric pressure.

Figure 2: List of the Most Intense North Atlantic Hurricanes Ranked By Barometric Pressure.

Figure 2: List of the Most Intense North Atlantic Hurricanes Ranked By Barometric Pressure.

I was also able to generate a list of the top 25 typhoons and hurricanes by air pressure (Figure 3). Notice how the Pacific has had the most intense storms, at least with respect to barometric pressure. Hurricane Patricia was interesting because it was a very intense hurricane that struck Mexico from its western side (see Figure 4).

Figure 3: Worldwide Ranking of Hurricanes and Typhoons by Air Pressure.

Figure 3: Worldwide Ranking of Hurricanes and Typhoons by Air Pressure.

Figure 3: Path and Intensity of Hurricane Patricia.

Figure 4: Path and Intensity of Hurricane Patricia. (Source)


Posted in Excel, History Through Spreadsheets | Leave a comment

Baby Name Frequency Ranking Using Power Query


Quote of the Day

In defeat unbeatable, in victory unbearable.

Churchill on General Montgomery, a man who was not popular with those he worked with and for.

Figure 1: Most Popular Baby Names of 2016.

Figure 1: Most Popular Baby Names of 2016 based on SSA data. You can verify my 2016 list by looking at the SSA web site.

You may have noticed that many of my recent posts are focused on data processing and analysis. These recent posts reflect the fact that I am preparing for a career change as I near "retirement," and I plan on working in the data analysis arena. This means that I have been in serious Python, R, and statistics training. While I love working with Python and R, I keep finding myself drawn back to Excel and Power Query (aka Get and Transform) for quick, ad hoc analysis work. While this blog will look at the frequency over time of the name Mark, you can use the tool to generate the same chart for any name.

In this post, I am going to use Social Security Administration (SSA) data on baby names, which dates back to 1880 – I assume the original data came from the Census Bureau because the SSA did not exist in 1880. The data is in the form of text files, one file for each year, that contain the number of babies given each name during that year. The SSA did not list names that occurred less than five times in a year. As an example of the information that can be extracted, Figure 1 shows the ten most popular names (male and female) in 2016.

I was inspired to look for this data because I am interested in the historical popularity of the name Mark (my name), which I became curious about because virtually every Mark I meet has an age from 50 to 60 years. Given that observation, it seems that I should see Mark being a very popular name during the late 1950s through the 1960s. I will show that the data supports this observation. I will present the results graphically. I also plotted data for other names in my family, but decided not to post that data for privacy reasons.

This is a really good data set to use for a Power Query example because:

  • The data consists of many text files in the same format that need to be combined.
  • Each text file name includes the year of the data, which I must extract and put into a column within the data.
  • There are well over a million lines of data, which exceeds the number of rows allowed in Excel. Power Query and Power Pivot allow these large data sets to be processed within the Excel environment. You can present the data using pivot tables, which allow you to aggregate the information down to a reasonable size.
  • I was able to use of an m-code routine to rank the names. I was inspired to use this approach by Chris Webb's excellent blog post on the topic.
  • I was able to demonstrate how to make the workbook relocatable by using sending Power Query the file address as a parameter.

Figure 2 show that the name Mark was quite popular (6th or 7th ranked) from 1957 to 1968. So my anecdotal evidence is supported by the data.

Figure 2: Popularity of the Name Mark in the US over Time.

Figure 2: Popularity of the Name Mark in the US over Time.

My workbook and the associated data is located here – the zip file is quite large (22.5 MB). You can generate Figure 2 for any name by inputting the name and gender (M/F), then pressing the update chart button. You can use it to check the popularity of any name over time. As an example, Figure 3 shows the popularity of the name Peter over time.

Figure 3: Popularity of the Name Peter Versus Time.

Figure 3: Popularity of the Name Peter Versus Time.

The workbook contains a single line macro that updates the query.






Posted in History Through Spreadsheets, Personal | Leave a comment

Largest Counties in the US


Quote of the Day

I don’t see why in Hell Roosevelt didn’t order Wainwright home and let MacArthur be a martyr … We’d have had a real General and a fighting man if we had Wainwright and not a play actor and bunco man such as we have now. Don’t see how a country can produce men as Robert E. Lee, John J. Pershing, Eisenhower, and Bradley and at the same time produce Custers, Pattons, and MacArthurs.

Harry S. Truman, Diary Entry – June 17, 1945. I have read everything I could find on MacArthur. I will never understand how a man like that rises to power.

Figure 1: Itasca County in Northern Minnesota.

Figure 1: Itasca County in Northern Minnesota.

I am currently building a retirement home in Itasca County, Minnesota. While at the construction site, I spoke with a sheriff's deputy about law enforcement coverage for my area. He told me that there were only two deputies on duty at night covering an area larger than the state of Delaware. He also said that if the deputies are tied up in the southern part of the county it can take a while for them to get to my place, which is in the the northern part of the county.  I had never thought about the size of the counties in northern Minnesota, but there are some large pieces of real estate there.

I became curious about the size of the largest counties (and county-like entities) in the United States and how they compare in size to the various states. I found county size data on the Census Bureau web site and state size data on the Wikipedia. It turns out that while some of the counties in Minnesota are large, they are dwarfed by the size of counties in the western states – Alaska in particular has some monster counties. Itasca County is 189th largest in the US. So while I think of it as big, there are many larger counties. For those who are interested, my analysis is in this workbook.

Figure 2 shows the states closest in area to the 25 largest counties by total area (i.e. land and water).

Figure 2: List of 25 Largest Counties by Total Area and the State Nearest in Area.

Figure 2: List of 25 Largest Counties by Total Area and the State Nearest in Area.

Figure 3 shows how many states are smaller than these counties. For an international comparison,  Yukon-Koyukuk, AK is larger in area than Germany (137,983 mi²). Figure 4 shows the size of Yukon-Koyukuk relative to the rest of Alaska.

Figure 3: Number Of States Smaller Than Specific Counties.

Figure 3: Number Of States Smaller Than Specific Counties.

Figure 4: Yukon-Koyukok Borough in Alaska.

Figure 4: Yukon-Koyukuk Borough in Alaska. (Source)

Posted in Excel, Personal | Leave a comment

Thermal Calculation Example for a Simple Electronic Component


Quote of the Day

I feel like a broken record talking to a brick wall.

— Engineer tired of telling executive management the same thing over and over.


Figure 1: Commutation diode used for switching power supply.

Figure 1: Commutation diode used for a switch-mode power supply. (Source)

An circuit designer came to me yesterday with an interesting problem whose solution nicely illustrates how simple component thermal calculations are performed. He was seeking advice on calculating the junction temperature of a Schottky diode (Figure 1) used in a switched-mode power supply.

At our company, the component temperature calculations are done by a mechanical engineer rather than by the circuit designer. Junction temperature calculations performed by the mechanical engineer showed that the diode was operating way above its allowed maximum temperature. This means that the circuit designer was being told to add a costly heat sink and make time-consuming Printed Circuit Board (PCB) design changes. We have not needed to take these actions for similar circuits in the past, and he was wondering if there could have been a modeling issue.

In this post, I will document my analysis of the diode's temperature. It turns out that the diode's temperature is within the allowed limits and there was a modeling issue – the wrong thermal resistance value was used. It was just a simple miscommunication that resulted in unnecessary engineering actions being requested.



All the term definitions (with one exception) used for this post are contained in two previous posts:

A previously unmentioned thermal resistance, θJL, is used in the following analysis. θJL represents the thermal resistance between the diode junction and its leads. This type of thermal resistance specification is common for power components.

A Modeling Warning

The modeling error that initiated this additional calculation was the inappropriate use of θJA. I almost never use θJA directly for determining junction temperature. I will quote from the JEDEC standard on its recommendation for the use of θJA – the standard uses the term "Theta-JA."

The intent of Theta-JA measurements is solely for a thermal performance comparison of one package to another in a standardized environment. This methodology is not meant to and will not predict the performance of a package in an application-specific environment.

Ignoring this statement was at the core of the modeling error.

Difference Between Case Temperature and Lead Temperature

Figure 2 shows where the lead temperature is to be measured. Power components often have a large copper "slug" – also known as the lead  – that is intended to provide an excellent thermal conduction path to the PCB.

Figure 2: Illustration showing how lead temperature is defined.

Figure 2: Illustration showing how lead temperature is defined. (Source)

Unfortunately,  case temperature measurements are standard practice because it is easier to attach a thermocouple to the case than to a tiny tab.


Junction Temperature Estimate Using θJL

The key formula for this calculation is Equation 1.

Eq. 1 \displaystyle {{T}_{J}}={{T}_{B}}+{{P}_{{Diode}}}\cdot {{\theta }_{{JL}}}


  • TB is the PCB temperature. I normally assume that TB is 10 °C above the air ambient temperature of 70 °C – set by industry standard.
  • PDiode is the power dissipated in the diode, which is 2.5 W in this application.
  • θJL is thermal resistance from Junction-to-Lead, which is 3 °C/W for this diode.
  • TJ is the diode junction temperature. By company convention, we normally limit TJ to no higher than 110 °C. This convention dates back to my days as a Navy contractor when we were required to conform to the Willoughby reliability guidelines, which strongly recommended keeping junction temperature under 110 °C. I should mention that the part is designed to function at temperatures as high as 150 °C. We would never run a part this hot because the lifetime is reduced at this temperature below our customer's expectations (15 years). If you are curious about modeling component failure rate with respect to temperature, see this post on laser reliability modeling.

Figure 3 shows the computation details.

Figure 3: Computing the TJ Using ThetaJL.

Figure 3: Computing the TJ Using θJL.

I estimate the junction temperature to be running near 88 °C. This compares well with testing done with similar circuits on previous products.

Junction Temperature Estimate Using Estimated ψJT

Just to provide folks an alternative approach to the calculation, there is a parameter called ψJT that is actually more appropriate for this calculation. We can use Equation 2 and ψJT to compute TJ. I derive this formula in gruesome detail in this blog post.

Eq. 2 \displaystyle {{T}_{J}}={{T}_{T}}+{{P}_{{Diode}}}\cdot {{\psi }_{{JT}}}


  • TT is the temperature measured at the top of the diode case. In this situation, I measured 88 °C.
  • ψJT is thermal resistance from Junction-to-Top of case.

Unfortunately, the diode vendor does not provide ψJT. However, there Equation 3 can be used to estimate ψJT sufficiently well for many applications.

Eq. 3 \displaystyle {{\psi }_{{JT}}}=h\cdot {{\theta }_{{JA}}}\cdot \frac{{{{\tau }_{{EMC}}}}}{{{{\kappa }_{{EMC}}}}}


  • h is the heat transfer coefficient of air.
  • τEMC is the thickness of the part's Epoxy Molding Compound (EMC).
  • κEMC is the thermal conductivity of the part's epoxy molding compound.
  • ψJT is thermal characteristic from Junction-to-Top of case.
  • θJA is the thermal resistance from Junction-to-Ambient.

Figure 4 shows how I computed ψJT and used it to compute TJ.

Figure 4: Junction Temperature calculation using Psi-JT estimate.

Figure 4: Junction Temperature calculation using ψJT estimate.

The use of the approximate method also yielded a junction temperature estimate of ~90 °C. The erroneous calculation using θJA produced a temperature estimate that was over 140 °C.


While going through my calculation, I saw where the error occurred in the other engineer's analysis. I then sent him a copy of my work, and we avoided making a costly change to a system. I posted this example because others might find it useful.

I should mention that this is a simple one-resistor analysis that is not particularly accurate – the answer should be viewed as rough. When more accurate answers are required, engineers need to turn to Delphi models (i.e. multi-resistor) or computational fluid dynamics software.

Posted in Electronics | 1 Comment

US Navy Warship Mix


Quote of the Day

The ability to focus is like a mental muscle. The more we work it out, the stronger it becomes …

— Daniel Goleman

Figure 1: USS John S. McCain Prior to Collision.

Figure 1: USS John S. McCain Prior to Collision. (Source)

While driving home after watching the eclipse, I listened to the news talking about a collision between a US Navy destroyer, the USS John S. McCain (Figure 1), and a commercial tanker.  A previous collision also involved a destroyer, the USS Fitzgerald. These stories have made me curious about how many of these ships the US Navy has.

The Wikipedia has an excellent list of US Navy ships, both commissioned and non-commissioned (i.e. sealift).  I used Excel with Get and Transform (also known as Power Query) to download, tidy, and process this table. For those who are interested, my workbook is available here

Destroyers are the most common type of commissioned warship in the US Navy (Figure 1). Note that I filtered the USS Pueblo and USS Constitution from the official list because the USS Constitution is a museum ship and the USS Pueblo is held by North Korea.

Figure 2: US Navy Warship Summary Table.

Figure 2: US Navy Warship Summary Table.

The US Navy has two classes of destroyers: Arleigh Burke (64 ships) and Zumwalt (1 ship). The Zumwal class has proven to be extremely expensive and only three will be built – two are still under construction.

The USS John S. McCain was commissioned in 1994, which means it is 23 years old. I became curious as to just how old US Navy ships are. Figure 3 answers that question – the McCain was commission during the peak years for US Navy ship commissioning. I was surprised to see that so many ships were built during the 1990s.

Figure 2: US Navy Warship Commissioning Dates.

Figure 3: US Navy Warship Commissioning Dates.

There is some data on the history of US Navy ship collisions. Statista has published an excellent chart (Figure 4).

Figure 4: Recent US Navy Ship Collision.

Figure 4: Recent US Navy Ship Collisions. (Source)

Posted in Excel, Military History | Leave a comment

Using Excel to Plot Year of Dedication and Locations of Confederate Monuments


Quote of the Day

Half of what you learn in medical school is going to be shown to be either dead wrong or out-of-date within five years. Trouble is, nobody can tell you which half.

— Dr. David Sackett, pioneer of evidence-based medicine. Everyone who works in the technology field understands this problem. I often compare our engineering knowledge to milk – it has a shelf life.

Figure 1: Monument to the First Minnesota at Gettysburg.

Figure 1: Monument to the First Minnesota at Gettysburg. (Source)

I have been listening to the controversy surrounding the Confederate monuments around the United States. I live in Minnesota, a Union State, and we have no Confederate monuments in the state. Minnesota did send troops to the Civil War and they performed well (Figure 1).

I heard some discussions on television about all the Confederate monuments around the country and when they were erected. I decided to look for the data and plot it for myself. I very quickly found a document from the Southern Poverty Law Center that looked interesting and provided me some interesting data tidying and charting challenges. My focus here is on duplicating their chart of monuments dedications dates. This chart type is not a standard Excel type and I wanted to see how I could duplicate it. This workbook will be used in a charting seminar that I plan to present in a month or so.

As I looked at the PDF source document, I suspected that it had been originally written in Microsoft Word. To cleanly copy the data into Excel, I use a three-step process to avoid a messy transfer:

  • copy the table from the original PDF document.
  • paste into Word and re-copy from Word.
  • paste into Excel.

This produced data that I could clean-up using Get and Transform (also known as Power Query).  Once cleaned up, I could use the data to look at which states had monuments (Figure 2). You can see that there are 1503 Confederate monuments listed in the data, which does not include approximately 2,570 Civil War battlefields, markers, plaques, cemeteries and similar symbols that commemorate historical events.

While the original data listed 35 location types, the original document formed three groups for plotting: (1) monuments at schools, (2) monuments at courthouses, and (3) other. I followed their lead.

Figure 2: Confederate Monument Summary Table.

Figure 2: Confederate Monument Summary Table.

Unfortunately, the original document only has dedication dates for 856 of the 1503 monuments. While this is a serious shortcoming, this is the data that they plotted, and I will do the same.

I plotted the dedication dates using the format used in the original document, which is not a standard Excel chart type. However, I was able to make a plot that looks very similar to the original by using some formulas in cells. Please see my workbook here for details.

Figure 3: Plot of Confederate Monuments By Year of Dedication and Location.

Figure 3: Plot of Confederate Monuments By Year of Dedication and Location.

Note that the original table added callouts for various important civil rights events. I decided that I wanted to keep the chart simple and did not include these in Figure 3.

There are some observations that we can make about Figure 3:

  • Confederate monuments began being dedicated before the Civil War was even finished.
  • There was a large surge in monument dedications around 1910.
  • There was smaller surge of monuments dedicated during the 1960s. The only unusual thing about these dedications is that many were at schools. Previous dedications were only rarely at schools.


Posted in History Through Spreadsheets, News Fact Checking | 1 Comment

Change in US House Square Footage Over Time


Quote of the Day

You always have prior information before you do an experiment, because something motivated you to do the experiment.

— Data Science Facts twitter feed. I view this statement as support for the Bayesian approach.

Figure 1: New House Square Footage in the US vs Time.

Figure 1: New US Housing Square Footage vs Year.

My wife and I are about half-way through the construction of a 2100 square foot home in northern Minnesota. This weekend, my neighbors and I were talking about the area of houses being built today, and no one in the conversation had any data. I grabbed my computer, jumped on the Internet, and very quickly found data from the Census Department that answered my question. Like most census information, the data is in the form of screwy tables that need to be parsed to get into a form that can easily be plotted. This exercise gave me another excellent example to use when I train staff on the use of Excel's Get and Transform tool. Figure 1 was the result of my search. For those who are interested, my workbook is here.

Figure 1 provides some interesting information:

  • The size of new US housing is rising over time.
  • The Great Recession (2008) saw a minor dip in the trend of larger housing.
  • House sizes are now larger than before the Great Recession.
  • My new house is considered mid-range in size.

The Census Department actually breaks the data down by region of the US: West, Midwest, South, and Northeast. Figure 2 shows a panel chart with all the data. The biggest home are being built in the Northeast. The smallest homes are being built in the Midwest, where I live.

Figure 2: New US Housing Square Footage By Region and Year.

Figure 2: New US Housing Square Footage vs Region and Year.


Posted in Construction | 1 Comment