Search This Blog

Thursday, December 26, 2019

Social Security Administration Baby Name Database

Introduction

At this link, you can access the SSA's so-called Baby Name data broken down both nationally and by state.  Each file is a zip file.  When you unzip the State file, it gives you a text file named for each 2-letter state code.  These files use a comma separated values format and each line includes, the state (which seems to me to be redundant), the sex, the year, the name and number of births.  The national data files are broken down by year.  In each year's file, again formatted in CSV, it includes the name, the sex and the number of births.  I downloaded these files and decided I wanted to make a program that would show a graph for a given name, for a given locale over the years.  In the program, I used Python's CSV module to read the file data into lists of lists.  One thing is that CSV defaults every field to strings so you have to convert things to floats.

The UI

So the UI has been made with Tkinter.  It consists of an Entry for the name, a couple of radio selectors for sex, a OptionMenu for the state, a button to generate the graph and a bunch of labels.
I used an Entry for the name information.  For the sex, so as to reduce user entry errors, I used radio buttons.  Again, to ensure data validity in calls to subroutines, I used an OptionMenu object (usually called a dropdown).

The Action

Basically, everything starts after the user hits the Graph button.  This then calls the calc function.  The calc function figures out whether the user selected a national or state level request.  If it is national, then the code calls the national function, if it is state level then it calls the bystate function.
Both the national and bystate functions go and read the appropriate files to build two lists which have corresponding x and y values to be graphed.  These both then call the graph function which uses PyPlot API calls to draw and display the graph in a separate window.  Typical results are below and of course, there's no reason why I would have picked the name Martin for the examples.


The Code

## Code to process downloaded national name data - Using GUI

from tkinter import *
from tkinter import ttk
import matplotlib.pyplot as plt
import csv

filelocation = ##INSERT STRING OF WHERE YOUR FILES ARE

## This function reads the information from the national files.
## Data is Name, M/F, number of births
def national(the_name,sex):
    xdata=[]
    ydata=[]
    for year in range(1880,2019):
        file = open(filelocation+"yob"+str(year)+'.txt', newline='')
        r = csv.reader(file)
        for row in r:
            if row[0]==the_name and row[1]==sex:
                xdata.append(float(year))
                ydata.append(float(row[2]))
        file.close()
    graph(xdata,ydata,the_name,sex,"USA")


## This function reads the information from the State files.
## Data is State,Sex (M/F), Year,Name, Number of births
def bystate(the_name,sex,state):
    xdata=[]
    ydata=[]
    file=open(filelocation+state+'.txt',newline='')
    r = csv.reader(file)
    for row in r:
        if row[3]==the_name and sex==row[1]:
            xdata.append(float(row[2]))
            ydata.append(float(row[4]))
    file.close()
    graph(xdata,ydata,the_name,sex,state)
            
## This part creates the graph using Matplotlib (plt)
def graph(xdata,ydata,the_name,sex,state):
    fig,ax = plt.subplots()
    line1, = ax.plot(xdata,ydata,label=the_name)
    ax.legend(loc='upper left')
    ax.set_title('Births per year in '+state+': '+the_name+' ('+sex+')')
    plt.ylabel('Number of births')
    plt.xlabel('Year')
    plt.show()

def calc():
    mf = ['M','F']
    if sel_state.get()=='USA':
        national(getn.get(),mf[sel_sex.get()])
    else:
        bystate(getn.get(),mf[sel_sex.get()],sel_state.get())
## Below is the data for the state/national selection dropdown
states=['USA','AK','AL','AR','AZ','CA','CO','CT','DC','DE','FL','GA',
        'HI','IA','ID','IL','IN','KS','KY','LA','MA','MD','ME',
        'MI','MN','MO','MS','MT','NC','ND','NE','NH','NJ','NM',
        'NV','NY','OH','OK','OR','PA','RI','SC','SD','TN','TX',
        'UT','VA','VT','WA','WI','WV','WY']
##Below is the set up for the GUI window
root = Tk()
content = ttk.Frame(root)
frame = ttk.Frame(content)
sel_state = StringVar()
sel_sex = IntVar()
lblinstr = ttk.Label(content, text="Enter Name and location")
getn= ttk.Entry(content, text="Name")
male= ttk.Radiobutton(content, text='Male', variable=sel_sex, value=0)
female=ttk.Radiobutton(content, text='Female',variable=sel_sex,value=1)
ok = ttk.Button(content, text="Graph", command=calc)
rlbl = ttk.Label(content, text="Enter Name")
slbl = ttk.Label(content, text="Enter Sex (M/F)")
stlbl = ttk.Label(content, text="Select USA or state")
statedd = ttk.OptionMenu(content, sel_state, *states)
##Below, we put the GUI together
content.grid(column = 0, row = 0)
frame.grid(column=0, row=0)
lblinstr.grid(column=0, row=0)
rlbl.grid(column=0, row=1)
getn.grid(column=1, row=1, sticky=N)
slbl.grid(column=0, row=2)
male.grid(column=1, row=2, sticky=N)
female.grid(column=2, row=2)
statedd.grid(column=1, row=3)
ok.grid(column=0, row=4)
## And run!!
root.mainloop()

No comments:

Post a Comment