Categories
Uncategorized

Comparing US baby names

Here I analyse the US Social Security Baby Name catalogue, which reports the name given to male and female newborns for every year since 1880.

First of all, I download the data sets from the US website and create a new folder “Names”. Simply writing ls Names/ breaks down all the contents within that folder.
It is often easier to see the data when it’s laid out in a pandas data frame or table like this. I slice out all the baby names recorded in the year 2007 – the file name is ‘yob2007’. Then I create another column, for the year.
Now I create a variable “allyears” so that I can manipulate all the data much more easily.

I need to load all the tables and concatenate them in a single data frame. To avoid confusing data from different years, we can prepare the individual data frames by adding a new column that specifies the year. To do it on the fly, directly from the output of “read CSV”, by chaining a method, we can use the data frame assign. 

We managed to load the file in a one liner, so you can see that I’m going to use a comprehension to concatenate all the data frames. 

This piece of code does several things. We loop over all the years between 1880 and 2018. We build up the file name using an f-string, and feed that into “read CSV”. We specify the column names, and we add the column that gives the correct year from the loop variable. Finally, we pass all the resulting data frames to pd concat, or pandas concat.

Here is what the data frame looks like.
In the top graph, we can see the popularity of the name (for boys) ‘Alex’ over time, from 1880 to 2020. Notice how Matplot-lib automatically uses the index to set the x-axis. It probably makes sense also to consider the frequency of a name as a fraction of the number of babies born in a year. So in the bottom graph, I measure the proportion of babies born on any given year who are named Mary. To get that, we can apply “group by” on the un-indexed data frame and take the sum. Then we can normalize Mary by all the newborns in every year. So as a percentage of all babies, Mary was actually more popular at the beginning of the 20th century. But there were altogether more Marys born in the 1920s and 50s.
Here I compare the popularity of multiple names. In the bottom graph, where I look at female names, my own name (Georgina) seems strikingly unpopular over time – the red line grazes the bottom!

What if I want to look at the variance of the same name, like Claire?

Here I look at the results for the names Claire, Clare, Clara, Chiara and Ciara. For instance, there are two spellings of Claire. There’s an older version Clara, and an Italian and Irish spelling for the pronunciation Ciara. Here’s the plot. Notice how Metro-lib tries to put the legend out of the way. Claire is now dominant, but Clara is having a resurgence after having been the dominant variant at the beginning of the 20th century. We can also make a slightly different cumulative or stacked plot that adds up the counts on top of each other (see the next graph).
Here, I’m searching for all the boys’ names given to babies in the year 2018. This is because I want to find out which are the ten most popular boys names in that year.
This data frame has sorted the values, with Liam being the most popular boys’ name in 2018.

Yearly top ten names: tracking the popularity of a name across years

Here I select the data for the given index (Male, Year). To select all records for a given index, we use .loc followed by brackets, not parentheses, with the index value. This is a multi index.loc. Chaining pandas allows us to see the top ten, and get rid of the index. If we are to build a table of the top 10 names over multiple years, we should get rid of the index with Reset Index, and select the name, Column Only.
This is the equivalent for girls’ names.

Plotting a graph to analyse the change in popularity over time

2018 top ten.
The top graph looks at the popularity of girls’ baby names across the entire database of records. We can see that Evelyn peaked in the 1920s period. As for the boys, Liam is at the top in recent years. William and James are classic favourites.

All-time favourite baby names

We select females, grouping by name, sort their values, and then take the top ten. If we look at the popularity over time of these names, we see that they’ve gained their spots in the first half of the 20th century except for Jennifer. Now that given the structure of the all-time f data frame, I’m looping over the index rather than the value.

Top ten unisex baby names

We’ll load our data set as usual. We need to compute the total number of boys and girls for a given name. This seems a good place to use group by, which lets us segment the data before applying an aggregation, in this case, the sum of the number of babies. So we use group by over sex and name, we select the number column and we take the sum. From this list with a multi-index, we can grab the males and females respectively, using dot lock. As you see, the two indices are going to be different. Nevertheless, we can combine the two series and pandas will align the indices for us. The results would be none where either series doesn’t have an element. For instance we check where the ratio between males and females is less than two. We can certainly get rid of those nones with drop in A. Now, remember the definition of unisex names as those with a ratio between .5 and two. This is a good expression for fancy indexing, and after we apply it, we see that 1660 names pass the test. Here, I’ve taken the index, because we don’t actually need the ratio itself, but just the names.

Advertisement
Categories
Uncategorized

Python: parsing web content for a recipe search

Spoonacular API is food and recipes API, allowing you to scrape web data from an online database with hundreds of thousands of recipes, products and ingredients. Using Python, I wrote a script that takes user input (what ingredients do you have? What are your dietary requirements?), sends a request to the API, and then returns a tailored list of recipes. The list also provides the user with calorie information about the different dishes. There are many other endpoints too, including the option to receive a random food joke, wine recommendations and recipes based on your carbohydrate limits.

Choosing a recipe

In my script, I ask users: what’s in your store cupboard? You can add ingredients like egg or tomato, and separate each one with a comma.

In this example, I add “spinach, egg”, which I type into the Python console at the bottom of the screen.
I ask the user for their dietary requirements.
If the user types in “yes”, they receive a second question: “What is your dietary requirement?” If the user says no, or I don’t know, they receive a list based on their ingredients (the values of the above input).
In this example, I stipulate a Vegetarian-friendly list of recipes. Each one is listed along with its calorie count per serving.
After this, a question pops up, asking the user which recipe they like the look of.
In this example, I choose “Baked Eggs with Spinach and Tomatoes”. The program then returns the list of ingredients I need.
Finally, the user is asked if they would like to continue to the recipe steps for that recipe.
And voilà! The program returns the recipe steps, along with the average number of servings it creates.

Carb counting

I also wrote a program which returns recipes based on Max and Min values for carbohydrate intake. It also writes the ingredients to an external file (you can see on the left), called “recipes_carb_limit_txt”. The results you get here in the python console look a little different to the ones above, because the response is in a dictionary format. A dictionary is a comma-separated list of key/value pairs inside curly brackets: {}

Random Food Joke

This simple program returns a random (usually punny) joke.

Documentation

Categories
Uncategorized

Turning property fund data into soup

The Python programming language can scrape information from a web page (a HTML page). Here, I scrape the table element from the performance data page provided by the Associate of Real Estate funds. This index shows the performance of property funds across a quarterly basis; here I look at the quarter ending March 2020. Both the Index and the Property Fund Vision enables investors and their advisers to compare fund performance and other relevant data, to appropriate alternative funds, either individually or at an aggregated level. I turn the table data into soup in Python, meaning that Python can read the structure and then write it into a CSV file.