Here I analyse the US Social Security Baby Name catalogue, which reports the name given to male and female newborns for every year since 1880.
I need to load all the tables and concatenate them in a single data frame. To avoid confusing data from different years, we can prepare the individual data frames by adding a new column that specifies the year. To do it on the fly, directly from the output of “read CSV”, by chaining a method, we can use the data frame assign.
We managed to load the file in a one liner, so you can see that I’m going to use a comprehension to concatenate all the data frames.
This piece of code does several things. We loop over all the years between 1880 and 2018. We build up the file name using an f-string, and feed that into “read CSV”. We specify the column names, and we add the column that gives the correct year from the loop variable. Finally, we pass all the resulting data frames to pd concat, or pandas concat.
What if I want to look at the variance of the same name, like Claire?
Yearly top ten names: tracking the popularity of a name across years
Plotting a graph to analyse the change in popularity over time
All-time favourite baby names
Top ten unisex baby names
We’ll load our data set as usual. We need to compute the total number of boys and girls for a given name. This seems a good place to use group by, which lets us segment the data before applying an aggregation, in this case, the sum of the number of babies. So we use group by over sex and name, we select the number column and we take the sum. From this list with a multi-index, we can grab the males and females respectively, using dot lock. As you see, the two indices are going to be different. Nevertheless, we can combine the two series and pandas will align the indices for us. The results would be none where either series doesn’t have an element. For instance we check where the ratio between males and females is less than two. We can certainly get rid of those nones with drop in A. Now, remember the definition of unisex names as those with a ratio between .5 and two. This is a good expression for fancy indexing, and after we apply it, we see that 1660 names pass the test. Here, I’ve taken the index, because we don’t actually need the ratio itself, but just the names.