Quote of the Day
Their drills are bloodless battles, and their battles bloody drills.
— Publius Flavius Vegetius Renatus, Roman writer on the Roman military. An older version of "train like you fight, fight like you train."
In this post, I will use Power Query (PQ) to gather (aka web scrape) the US Medal of Honor (MOH) recipient names, rank, service, and conflict from a website called the Congressional Medal of Honor Society (CMOHS), which states that there have been 3505 MOH recipients. They have a beautiful website with the records stored on 140 ages with 25 names per page and one page with 5 names. Normally, I would use Python to scrape a large number of web pages but I am trying to use PQ more because my customers all have Excel. My approach as a contractor is always to work within the existing infrastructure of my customers. A copy of my workbook is available here.
I should mention that the Wikipedia states there have been 3522 MOH recipients. I do not know the reason for the discrepancy, but I do know that I gathered every MOH recipient from the CMOHS pages.
The web scraping process was straightforward:
- Create a list of all web pages with MOH recipients.
This was easier than you might think. There were 141 pages, each with the base URL and an appended number (1 through 141). You can see the format in the attached workbook. - Build a parsing template by parsing one webpage
This was a little harder than you might think. PQ could not determine how to identify the information on the web page because it was not table-oriented. However, I took a standard web query and modified it by converting a WebPage command into a WebContents command – it loaded in the raw HTML. I then parsed the HTML to extract the information. I illustrate the process in Figure 2, a figure which is large and you must click on it to view. The HTML parsing was routine and I just refer you to the query to see how I did it. - Turn the single-page parsing routine into a function
This is a routine process. You add a function header (aka signature) and a closing phrase. I illustrate the process in Figure 3. - Load the list of web page addresses into PQ
Make the list of web page addresses into a table and load it into PQ. - Run the function over every web address, producing a column of tables.
I used the "Invoke Custom Function" command from the menu to create a new column with the parsed data. I illustrate the process in Figure 4. - Expand the column of tables into its separate fields.
You now have all the MOH recipients. At this point, I can begin filtering and pivoting.
I know it sounds complicated but it is simple when you see it in the spreadsheet.
Now that we have the data, let's see what we can learn. I decided to focus on WW2 and later conflicts. I wanted to know the number of MOH recipients by:
- conflict
- service
- rank
I now could generate the pivot tables that I wanted (Figures 5 through 7).
The results make sense:
- WW2 was such a massive conflict that the number of recipients should be larger.
- The US Army is the largest of the US military services, so it would make sense for it to have the most recipients.
- Enlisted ranks would be more likely to be in the life and death situations that would call for such acts of bravery.
Fig 5 & 6 have the same description?
Thank you for the help!
mark
Pingback: US Submarine Production During WW2 | Math Encounters Blog