Scraping Day Care Data in the Naked City

Data's a funny thing. On the surface it appears fairly banal. Warehouses of tabulated information constructed out of tables with rows and columns of numbers and text representing a series of events or things that share a common type.
This post was published on the now-closed HuffPost Contributor platform. Contributors control their own work and posted freely to our site. If you need to flag this entry as abusive, send us an email.

Data's a funny thing. On the surface it appears fairly banal. Warehouses of tabulated information constructed out of tables with rows and columns of numbers and text representing a series of events or things that share a common type. But not all data is created equal nor, as you will read, easily harvested. Collecting and disseminating civic data is more often than not a labor of love born out of necessity.

As it happened one day, Anita Schmid, a physicist turned data scientist and mother of a small child, motivated by the lack of useful and timely day care information sent an email to the BetaNYC dev list asking how one would go about acquiring day care data from the city. Within hours several BetaNYCers began assisting. One of the groups organizers, Chris Whong, wrote a Ruby scraper that enabled the retrieval of data from the NYC Department of Health and Mental Hygiene (DoHMH) web site. Others worked on summarizing the information for each location as some daycare centers have multiple permits. In order to finally generate a useful map all the addresses had to be geocoded. This was also swiftly accomplished by members of the BetaNYC community. Many others contributed insights and support. Two weeks and one free Cartodb account later the map was online:

Day Care Map concentrates on center-based child care for children up to 5 years old. These are facilities that require personnel with training in early childhood education or related studies. In the City of New York, center-based group child care is regulated by NYC Health Code Article 47 and are licensed by the DoHMH. The DoHMH does provide an online directory of all centers with permits including a history of inspections (, however, the only filter option to search a particular neighborhood is by zip code and returns a sometimes long list of hyperlinked named facilities. A parent then has to click on each link to find more information such as what ages are covered and the facilities capacity. Neither intuitive nor user-friendly for a large urban landscape like New York where a block or street can be the determining factor in selecting a child's future day care.

Day Care Map attempts to capture all licensed day care center data from New York city and state government agency web sites using a method called "data scraping." Because scraping can be slow and have the negative effect of throttling web servers, Anita leveraged her data mining skills toward developing a faster Python based scraper. Making use of open-source Python libraries the script assigns a base url and issues a get request to begin parsing HTML for relevant data. This is the logic brutally summarized:


After several 'for' loops parse the requests by targeting html class selectors the data is then written and saved to three .csv (comma delimited) text files. Depending on the amount of content being "scraped" and the web sites server and network resources the process can take minutes to hours. The saved data then undergoes a round of geocoding which assigns lat/lon coordinate columns to each day care center. Without geo coordinates we would be monkishly back to drawing
points by hand. Time better spent brewing mead and making cheese.

When the City of New York launched it's OpenData portal in 2011 many new civic applications were born but to this date the DoHMH day care dataset has not been added and attempts to have this dataset released have been so far unsuccessful. Ultimately, one of the goals of the Day Care Map project is to update the application's map interface regularly and in real time. One can dream. Data scraping is a big weapon in the arsenal of civic data hackers and will always be a necessary choice given the state of some public data (can you say PDF). In an ideal world OpenData should strive to move beyond data scraping and in the process provide parents, and data scientists, more of that illusive quality-time with their young children. This is the kind of qualitative data that is a net positive - easing the stress of locating adequate child care in the Naked City.

About the project:

Day Care Map is a web app that promises to give parents the latest information via an easy-to-use and mobile-friendly interface, a one-stop map of all licensed child care options, center-based group child care, private and family day care within New York City.

To contact Anita or for more info visit BigApps

Popular in the Community