Search This Blog

Thursday, December 26, 2019

Social Security Administration Baby Name Database

Introduction

At this link, you can access the SSA's so-called Baby Name data broken down both nationally and by state.  Each file is a zip file.  When you unzip the State file, it gives you a text file named for each 2-letter state code.  These files use a comma separated values format and each line includes, the state (which seems to me to be redundant), the sex, the year, the name and number of births.  The national data files are broken down by year.  In each year's file, again formatted in CSV, it includes the name, the sex and the number of births.  I downloaded these files and decided I wanted to make a program that would show a graph for a given name, for a given locale over the years.  In the program, I used Python's CSV module to read the file data into lists of lists.  One thing is that CSV defaults every field to strings so you have to convert things to floats.

The UI

So the UI has been made with Tkinter.  It consists of an Entry for the name, a couple of radio selectors for sex, a OptionMenu for the state, a button to generate the graph and a bunch of labels.
I used an Entry for the name information.  For the sex, so as to reduce user entry errors, I used radio buttons.  Again, to ensure data validity in calls to subroutines, I used an OptionMenu object (usually called a dropdown).

The Action

Basically, everything starts after the user hits the Graph button.  This then calls the calc function.  The calc function figures out whether the user selected a national or state level request.  If it is national, then the code calls the national function, if it is state level then it calls the bystate function.
Both the national and bystate functions go and read the appropriate files to build two lists which have corresponding x and y values to be graphed.  These both then call the graph function which uses PyPlot API calls to draw and display the graph in a separate window.  Typical results are below and of course, there's no reason why I would have picked the name Martin for the examples.


The Code

## Code to process downloaded national name data - Using GUI

from tkinter import *
from tkinter import ttk
import matplotlib.pyplot as plt
import csv

filelocation = ##INSERT STRING OF WHERE YOUR FILES ARE

## This function reads the information from the national files.
## Data is Name, M/F, number of births
def national(the_name,sex):
    xdata=[]
    ydata=[]
    for year in range(1880,2019):
        file = open(filelocation+"yob"+str(year)+'.txt', newline='')
        r = csv.reader(file)
        for row in r:
            if row[0]==the_name and row[1]==sex:
                xdata.append(float(year))
                ydata.append(float(row[2]))
        file.close()
    graph(xdata,ydata,the_name,sex,"USA")


## This function reads the information from the State files.
## Data is State,Sex (M/F), Year,Name, Number of births
def bystate(the_name,sex,state):
    xdata=[]
    ydata=[]
    file=open(filelocation+state+'.txt',newline='')
    r = csv.reader(file)
    for row in r:
        if row[3]==the_name and sex==row[1]:
            xdata.append(float(row[2]))
            ydata.append(float(row[4]))
    file.close()
    graph(xdata,ydata,the_name,sex,state)
            
## This part creates the graph using Matplotlib (plt)
def graph(xdata,ydata,the_name,sex,state):
    fig,ax = plt.subplots()
    line1, = ax.plot(xdata,ydata,label=the_name)
    ax.legend(loc='upper left')
    ax.set_title('Births per year in '+state+': '+the_name+' ('+sex+')')
    plt.ylabel('Number of births')
    plt.xlabel('Year')
    plt.show()

def calc():
    mf = ['M','F']
    if sel_state.get()=='USA':
        national(getn.get(),mf[sel_sex.get()])
    else:
        bystate(getn.get(),mf[sel_sex.get()],sel_state.get())
## Below is the data for the state/national selection dropdown
states=['USA','AK','AL','AR','AZ','CA','CO','CT','DC','DE','FL','GA',
        'HI','IA','ID','IL','IN','KS','KY','LA','MA','MD','ME',
        'MI','MN','MO','MS','MT','NC','ND','NE','NH','NJ','NM',
        'NV','NY','OH','OK','OR','PA','RI','SC','SD','TN','TX',
        'UT','VA','VT','WA','WI','WV','WY']
##Below is the set up for the GUI window
root = Tk()
content = ttk.Frame(root)
frame = ttk.Frame(content)
sel_state = StringVar()
sel_sex = IntVar()
lblinstr = ttk.Label(content, text="Enter Name and location")
getn= ttk.Entry(content, text="Name")
male= ttk.Radiobutton(content, text='Male', variable=sel_sex, value=0)
female=ttk.Radiobutton(content, text='Female',variable=sel_sex,value=1)
ok = ttk.Button(content, text="Graph", command=calc)
rlbl = ttk.Label(content, text="Enter Name")
slbl = ttk.Label(content, text="Enter Sex (M/F)")
stlbl = ttk.Label(content, text="Select USA or state")
statedd = ttk.OptionMenu(content, sel_state, *states)
##Below, we put the GUI together
content.grid(column = 0, row = 0)
frame.grid(column=0, row=0)
lblinstr.grid(column=0, row=0)
rlbl.grid(column=0, row=1)
getn.grid(column=1, row=1, sticky=N)
slbl.grid(column=0, row=2)
male.grid(column=1, row=2, sticky=N)
female.grid(column=2, row=2)
statedd.grid(column=1, row=3)
ok.grid(column=0, row=4)
## And run!!
root.mainloop()

Sunday, December 22, 2019

Creating graphs of Federal Reserve GDP data

After having read some interesting speculations about the future of work, AI, and robotics, I ran across a free resource in the form of the St. Louis Federal Reserve bank FRED system.  You do have to sign up with a valid email address, but there is no charge and the government is, well, supposed to spam or sell your email address.
After I had signed up, I configured to download some data.  This was the total annualized quarterly GDP as well as the goods and services components, see the picture below which is a screen shot of data from the FRED system.
The download was a text file, it had a header row and was delimited by tabs.  This is not ideal, but it is something that the csv Python standard library can handle.  The idea was to use matplotlib to graph the data.  I could have tried to create the graph on a Tkinter canvas using the draw functions.  But I am getting ahead of myself...
First we need to read the data and create Lists for the x and y values of the two different lines.  I will run through the code I wrote.  First we need to open the text file:
file = open(filelocation, newline='')
r = csv.reader(file,delimiter='\t')
Then I will initialize variables:
period=[]
goods=[]
services=[]
header=True
The variable header is there so that we can skip over the header row during the loop which reads the data and creates the iterables that will be used to graph the data.  The Pyplot api expects NumPy arrays for the data to be graphed but Lists and most other iterables work.
for row in r:
    if header:
        header=False
    else:
        date=row[0]
        serv=float(row[1])
        good=float(row[2])
        total=serv+good
        dateval=float(date[0:4])+float(date[5:7])/12
        period.append(dateval)
        goods.append(good/total*100)
        services.append(serv/total*100)

Now at the start I had imported matplotlib.pyplot as plt, and the next part of the code creates the graph and shows it using rather simple matplotlib calls.  I did look up a lot of stuff on this in the examples and tutorials, but it is not very clear so I struggled a bit to get it to work right.
fig,ax = plt.subplots()
line1, = ax.plot(period,goods,label='Goods')
line2, = ax.plot(period,services,label='Services')
ax.legend(loc='upper left')
ax.set_title('Change in GDP composition over time')
plt.ylabel('% of GDP')
plt.xlabel('Year')
plt.show()
Here's the final product:

What does this mean?

In this graph we see a long-terms trend wherein the American economy has been more and more reliant on services instead of goods.  We don't make things any more, or maybe a better way to put it, we make things with such high efficiency and productivity that to feed, house, cloth, provide transport and other utilities, some one third of us can provide for everyone.  So we have to find something for the other 2/3rds to do.  That turns out to be providing any and all kinds of services to the general population.  This trend has been around for a while and it probably extends to the left back to the turn of the century although likely it flattens out.  During that time, many have treated it like the weather: something to complain about, but nothing can be done about it.  Even the current politicians are not going to be able to reverse this trend no matter how hard they try.  About the only possible strategy would be to do everything to limit or even reverse the growth in population.  However, this would have other significant adverse consequences, putting the economy into a depression.
One take-away is that if you are starting a new business, concentrate on services rather than goods.  If you have a goods-related business, then consider adding some kind of service provision to complement what you make.  Growth in Services is much more likely, if not easier.