Table of Contents¶

  1. Which tree species should the city plant?
    1.1 Background
    1.2 The Data

  2. Data Cleansing
    2.1 Explore Column Types and Entries
    2.2 Convert Data Types and Transform Entries
    2.3 Cleaning null values
    2.4 Check for duplicates

  3. Analysis
    3.1 What are the most common tree species in Manhattan?
    3.2 Which neighborhoods has the most trees?
    3.3 A visualization of Manhattan's neighborhoods and trees
    3.4 Ten species recommendation

Which tree species should the city plant?¶

📖 Background¶

You work for a nonprofit organization advising the planning department on ways to improve the quantity and quality of trees in New York City. The urban design team believes tree size (using trunk diameter as a proxy for size) and health are the most desirable characteristics of city trees.

The city would like to learn more about which tree species are the best choice to plant on the streets of Manhattan.

💾 The data¶

The team has provided access to the 2015 tree census and geographical information on New York City neighborhoods (trees, neighborhoods):

Tree Census¶

  • "tree_id" - Unique id of each tree.
  • "tree_dbh" - The diameter of the tree in inches measured at 54 inches above the ground.
  • "curb_loc" - Location of the tree bed in relation to the curb. Either along the curb (OnCurb) or offset from the curb (OffsetFromCurb).
  • "spc_common" - Common name for the species.
  • "status" - Indicates whether the tree is alive or standing dead.
  • "health" - Indication of the tree's health (Good, Fair, and Poor).
  • "root_stone" - Indicates the presence of a root problem caused by paving stones in the tree bed.
  • "root_grate" - Indicates the presence of a root problem caused by metal grates in the tree bed.
  • "root_other" - Indicates the presence of other root problems.
  • "trunk_wire" - Indicates the presence of a trunk problem caused by wires or rope wrapped around the trunk.
  • "trnk_light" - Indicates the presence of a trunk problem caused by lighting installed on the tree.
  • "trnk_other" - Indicates the presence of other trunk problems.
  • "brch_light" - Indicates the presence of a branch problem caused by lights or wires in the branches.
  • "brch_shoe" - Indicates the presence of a branch problem caused by shoes in the branches.
  • "brch_other" - Indicates the presence of other branch problems.
  • "postcode" - Five-digit zip code where the tree is located.
  • "nta" - Neighborhood Tabulation Area (NTA) code from the 2010 US Census for the tree.
  • "nta_name" - Neighborhood name.
  • "latitude" - Latitude of the tree, in decimal degrees.
  • "longitude" - Longitude of the tree, in decimal degrees.

Neighborhoods' geographical information¶

  • "ntacode" - NTA code (matches Tree Census information).
  • "ntaname" - Neighborhood name (matches Tree Census information).
  • "geometry" - Polygon that defines the neighborhood.

Tree census and neighborhood information from the City of New York NYC Open Data.

In [1]:
#remove warnings
import warnings
warnings.filterwarnings('ignore')
#import modules
import pandas as pd
import geopandas as gpd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import folium

#import trees data
trees = pd.read_csv('data/trees.csv')
trees = trees.rename(columns = {'trunk_wire' : 'trnk_wire'})
trees.head()
Out[1]:
tree_id tree_dbh curb_loc spc_common status health root_stone root_grate root_other trnk_wire trnk_light trnk_other brch_light brch_shoe brch_other postcode nta nta_name latitude longitude
0 190422 11 OnCurb honeylocust Alive Good No No No No No No No No No 10023 MN14 Lincoln Square 40.770046 -73.984950
1 190426 11 OnCurb honeylocust Alive Good No No No No No No No No No 10023 MN14 Lincoln Square 40.770210 -73.985338
2 208649 9 OnCurb American linden Alive Good No Yes No No No No No No No 10019 MN15 Clinton 40.762724 -73.987297
3 193310 14 OnCurb honeylocust Alive Good No No Yes No No Yes No No Yes 10021 MN40 Upper East Side-Carnegie Hill 40.772171 -73.960456
4 199760 4 OnCurb willow oak Alive Fair No No No No No No No No No 10024 MN12 Upper West Side 40.782087 -73.980964
In [2]:
def to_name(entry):
    words = entry.str.split()
    new_entry = ''
    for word in words:
        new_entry = ' '.join([new_entry, word])
        new_entry = new_entry.strip()
    return new_entry
In [3]:
trees.head()
Out[3]:
tree_id tree_dbh curb_loc spc_common status health root_stone root_grate root_other trnk_wire trnk_light trnk_other brch_light brch_shoe brch_other postcode nta nta_name latitude longitude
0 190422 11 OnCurb honeylocust Alive Good No No No No No No No No No 10023 MN14 Lincoln Square 40.770046 -73.984950
1 190426 11 OnCurb honeylocust Alive Good No No No No No No No No No 10023 MN14 Lincoln Square 40.770210 -73.985338
2 208649 9 OnCurb American linden Alive Good No Yes No No No No No No No 10019 MN15 Clinton 40.762724 -73.987297
3 193310 14 OnCurb honeylocust Alive Good No No Yes No No Yes No No Yes 10021 MN40 Upper East Side-Carnegie Hill 40.772171 -73.960456
4 199760 4 OnCurb willow oak Alive Fair No No No No No No No No No 10024 MN12 Upper West Side 40.782087 -73.980964
In [4]:
#import neighborhoods geodata
neighborhoods = gpd.read_file('data/nta.shp')
neighborhoods.head()
Out[4]:
borocode boroname countyfips ntacode ntaname shape_area shape_leng geometry
0 3.0 Brooklyn 047 BK43 Midwood 3.579964e+07 27996.591274 POLYGON ((-73.94733 40.62917, -73.94687 40.626...
1 3.0 Brooklyn 047 BK75 Bedford 3.262983e+07 29992.919174 POLYGON ((-73.94193 40.70073, -73.94439 40.700...
2 2.0 Bronx 005 BX40 Fordham South 6.307284e+06 15878.272921 POLYGON ((-73.89138 40.86170, -73.89106 40.861...
3 3.0 Brooklyn 047 BK88 Borough Park 5.400502e+07 39247.227722 POLYGON ((-73.97605 40.63128, -73.97717 40.630...
4 3.0 Brooklyn 047 BK96 Rugby-Remsen Village 3.270695e+07 30957.853395 POLYGON ((-73.90856 40.65210, -73.90945 40.651...

Data Cleansing¶

Explore column types and entries¶

In [5]:
#check trees dtypes and non-null count
for df in [trees, neighborhoods]:
	display(df.info(), df.isna().sum())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64229 entries, 0 to 64228
Data columns (total 20 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   tree_id     64229 non-null  int64  
 1   tree_dbh    64229 non-null  int64  
 2   curb_loc    64229 non-null  object 
 3   spc_common  62428 non-null  object 
 4   status      64229 non-null  object 
 5   health      62427 non-null  object 
 6   root_stone  64229 non-null  object 
 7   root_grate  64229 non-null  object 
 8   root_other  64229 non-null  object 
 9   trnk_wire   64229 non-null  object 
 10  trnk_light  64229 non-null  object 
 11  trnk_other  64229 non-null  object 
 12  brch_light  64229 non-null  object 
 13  brch_shoe   64229 non-null  object 
 14  brch_other  64229 non-null  object 
 15  postcode    64229 non-null  int64  
 16  nta         64229 non-null  object 
 17  nta_name    64229 non-null  object 
 18  latitude    64229 non-null  float64
 19  longitude   64229 non-null  float64
dtypes: float64(2), int64(3), object(15)
memory usage: 9.8+ MB
None
tree_id          0
tree_dbh         0
curb_loc         0
spc_common    1801
status           0
health        1802
root_stone       0
root_grate       0
root_other       0
trnk_wire        0
trnk_light       0
trnk_other       0
brch_light       0
brch_shoe        0
brch_other       0
postcode         0
nta              0
nta_name         0
latitude         0
longitude        0
dtype: int64
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 195 entries, 0 to 194
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   borocode    195 non-null    float64 
 1   boroname    195 non-null    object  
 2   countyfips  195 non-null    object  
 3   ntacode     195 non-null    object  
 4   ntaname     195 non-null    object  
 5   shape_area  195 non-null    float64 
 6   shape_leng  195 non-null    float64 
 7   geometry    195 non-null    geometry
dtypes: float64(3), geometry(1), object(4)
memory usage: 12.3+ KB
None
borocode      0
boroname      0
countyfips    0
ntacode       0
ntaname       0
shape_area    0
shape_leng    0
geometry      0
dtype: int64

Observations:

  • There are 64229 entries on the trees dataframe.
  • The columns from root_stone to brch_other is currently listed as object thought they could be interpreted as boolean data.
  • There are null values on the spc_common and health column on the trees data.
In [6]:
#initialize list with column of object and null data types
col_obj = list(trees.columns[trees.dtypes == 'object'])
col_num = list(trees.columns[(trees.dtypes != 'object') & (trees.dtypes != 'bool')])
In [7]:
#loop through object columns to explore entries
for col in col_obj:
    display(trees[col].value_counts(normalize = True))
OnCurb            0.933099
OffsetFromCurb    0.066901
Name: curb_loc, dtype: float64
honeylocust         0.211059
Callery pear        0.116887
ginkgo              0.093852
pin oak             0.073429
Sophora             0.071330
                      ...   
spruce              0.000016
Osage-orange        0.000016
red pine            0.000016
Persian ironwood    0.000016
smoketree           0.000016
Name: spc_common, Length: 128, dtype: float64
Alive    0.971944
Dead     0.028056
Name: status, dtype: float64
Good    0.758614
Fair    0.183574
Poor    0.057812
Name: health, dtype: float64
No     0.804201
Yes    0.195799
Name: root_stone, dtype: float64
No     0.961357
Yes    0.038643
Name: root_grate, dtype: float64
No     0.921889
Yes    0.078111
Name: root_other, dtype: float64
No     0.985723
Yes    0.014277
Name: trnk_wire, dtype: float64
No     0.994847
Yes    0.005153
Name: trnk_light, dtype: float64
No     0.913123
Yes    0.086877
Name: trnk_other, dtype: float64
No     0.986377
Yes    0.013623
Name: brch_light, dtype: float64
No     0.99905
Yes    0.00095
Name: brch_shoe, dtype: float64
No     0.897803
Yes    0.102197
Name: brch_other, dtype: float64
MN12    0.090411
MN40    0.071868
MN23    0.059179
MN03    0.054010
MN13    0.045634
MN36    0.045525
MN09    0.042099
MN11    0.041150
MN35    0.040667
MN34    0.039001
MN04    0.036790
MN31    0.035451
MN19    0.034657
MN32    0.033941
MN24    0.033785
MN14    0.031824
MN15    0.030422
MN33    0.030282
MN28    0.029831
MN20    0.026530
MN22    0.024008
MN01    0.022980
MN27    0.022684
MN25    0.020147
MN17    0.018434
MN21    0.017780
MN06    0.014044
MN50    0.006866
Name: nta, dtype: float64
Upper West Side                               0.090411
Upper East Side-Carnegie Hill                 0.071868
West Village                                  0.059179
Central Harlem North-Polo Grounds             0.054010
Hudson Yards-Chelsea-Flatiron-Union Square    0.045634
Washington Heights South                      0.045525
Morningside Heights                           0.042099
Central Harlem South                          0.041150
Washington Heights North                      0.040667
East Harlem North                             0.039001
Hamilton Heights                              0.036790
Lenox Hill-Roosevelt Island                   0.035451
Turtle Bay-East Midtown                       0.034657
Yorkville                                     0.033941
SoHo-TriBeCa-Civic Center-Little Italy        0.033785
Lincoln Square                                0.031824
Clinton                                       0.030422
East Harlem South                             0.030282
Lower East Side                               0.029831
Murray Hill-Kips Bay                          0.026530
East Village                                  0.024008
Marble Hill-Inwood                            0.022980
Chinatown                                     0.022684
Battery Park City-Lower Manhattan             0.020147
Midtown-Midtown South                         0.018434
Gramercy                                      0.017780
Manhattanville                                0.014044
Stuyvesant Town-Cooper Village                0.006866
Name: nta_name, dtype: float64

Observations

  • We can see that the columns we need to turn to bool type have no missing values.

Convert data types and transform entries¶

In [8]:
#list columns to turn to bool
to_bool = ['root_stone', 'root_grate', 'root_other', 'trnk_wire', 'trnk_light', 'trnk_other', 'brch_light', 'brch_shoe', 'brch_other']

for col in to_bool:
	trees[col] = np.where(trees[col] == 'Yes', True, False)
#Convert columns to boolean, note that only trees with good health will be considered healthy
trees['alive'] = np.where(trees.status == 'Alive', True, False)
trees['healthy'] = np.where(trees.health == 'Good', True, False)
trees['on_curb'] = np.where(trees.curb_loc == 'OnCurb', True, False)

trees.sample(20)
Out[8]:
tree_id tree_dbh curb_loc spc_common status health root_stone root_grate root_other trnk_wire ... brch_shoe brch_other postcode nta nta_name latitude longitude alive healthy on_curb
2163 196865 3 OnCurb cherry Alive Good False False False False ... False False 10002 MN27 Chinatown 40.717069 -73.991512 True True True
63624 189960 4 OnCurb Sophora Alive Good False False False False ... False False 10065 MN31 Lenox Hill-Roosevelt Island 40.763563 -73.963746 True True True
40283 70926 11 OnCurb honeylocust Alive Good True False False False ... False False 10013 MN24 SoHo-TriBeCa-Civic Center-Little Italy 40.725996 -74.008768 True True True
31314 533678 10 OnCurb honeylocust Alive Good False False False False ... False False 10019 MN15 Clinton 40.764134 -73.990354 True True True
63924 203888 7 OnCurb Japanese zelkova Alive Fair True False False False ... False False 10025 MN09 Morningside Heights 40.805910 -73.961159 True False True
43542 49341 19 OnCurb honeylocust Alive Good True False False False ... False False 10011 MN23 West Village 40.734688 -73.999618 True True True
10845 162819 3 OnCurb serviceberry Alive Good False False False False ... False False 10009 MN28 Lower East Side 40.725483 -73.978590 True True True
38716 65913 17 OnCurb Chinese elm Alive Good False False False False ... False False 10031 MN04 Hamilton Heights 40.821959 -73.948276 True True True
14186 255690 7 OnCurb Japanese zelkova Alive Good False False False False ... False False 10040 MN35 Washington Heights North 40.855276 -73.930711 True True True
3088 171661 16 OnCurb Sophora Alive Good False False False False ... False False 10009 MN28 Lower East Side 40.727591 -73.976396 True True True
21687 296601 8 OnCurb crimson king maple Alive Fair False False False False ... False True 10021 MN40 Upper East Side-Carnegie Hill 40.768693 -73.964001 True False True
50049 107416 6 OnCurb Japanese zelkova Alive Good False False False False ... False False 10034 MN01 Marble Hill-Inwood 40.865816 -73.922447 True True True
12330 278478 5 OnCurb ginkgo Alive Good True False False False ... False False 10128 MN32 Yorkville 40.778958 -73.948196 True True True
23989 363310 4 OnCurb Chinese elm Alive Good False False False False ... False False 10032 MN36 Washington Heights South 40.844657 -73.940615 True True True
37033 43604 16 OnCurb honeylocust Alive Good True False True False ... False False 10027 MN34 East Harlem North 40.804893 -73.941464 True True True
33906 27732 9 OnCurb ginkgo Alive Good False True False False ... False False 10011 MN13 Hudson Yards-Chelsea-Flatiron-Union Square 40.741478 -73.997205 True True True
20760 310333 10 OnCurb ginkgo Alive Fair False False False False ... False False 10036 MN15 Clinton 40.760222 -73.989072 True False True
9361 241442 11 OnCurb London planetree Alive Fair False False False False ... False False 10016 MN20 Murray Hill-Kips Bay 40.743966 -73.977797 True False True
31611 537302 18 OnCurb willow oak Alive Good True False True False ... False False 10024 MN12 Upper West Side 40.780991 -73.979152 True True True
60550 181097 5 OnCurb Japanese zelkova Alive Fair False False False False ... False True 10040 MN35 Washington Heights North 40.864522 -73.927832 True False True

20 rows × 23 columns

In [9]:
trees.dtypes
Out[9]:
tree_id         int64
tree_dbh        int64
curb_loc       object
spc_common     object
status         object
health         object
root_stone       bool
root_grate       bool
root_other       bool
trnk_wire        bool
trnk_light       bool
trnk_other       bool
brch_light       bool
brch_shoe        bool
brch_other       bool
postcode        int64
nta            object
nta_name       object
latitude      float64
longitude     float64
alive            bool
healthy          bool
on_curb          bool
dtype: object

Now, we have the data types as types that we need. We could proceed in cleaning up null values.

Cleaning null values¶

In [10]:
#import missingno package
import missingno as msno

#create figure
fig, (ax1, ax2) = plt.subplots(2, 1, figsize = (10,8), tight_layout = True)
fig.suptitle('Null Values Details')

#create ax1 to find out if missing values are determined by tree_dbh
ax1 = msno.matrix(trees.sort_values('tree_dbh'), ax = ax1)

#create ax2 to find out if missing values are correlated
ax2 = msno.heatmap(trees, ax = ax2)

#show figure
plt.show()
In [11]:
#print out percentage of missing values
print('Percentage of missing values: ', round(100*(max(trees.isna().sum()) / trees.shape[0]),2), '%')
Percentage of missing values:  2.81 %

That 1 on the correlation heatmap indicates that when the value for health is missing, so is the species. We can be sure that there is only 1802 rows with missing values. The matrix shows that all the missing values have trunk diameters that are less than the median. But it is also worth noting that the percentage of missing values is similar to the percentage of dead trees. Let us check.

In [12]:
#check for dead trees
trees.query('status != "Alive"').isna().sum()
Out[12]:
tree_id          0
tree_dbh         0
curb_loc         0
spc_common    1801
status           0
health        1802
root_stone       0
root_grate       0
root_other       0
trnk_wire        0
trnk_light       0
trnk_other       0
brch_light       0
brch_shoe        0
brch_other       0
postcode         0
nta              0
nta_name         0
latitude         0
longitude        0
alive            0
healthy          0
on_curb          0
dtype: int64

We can see that the null values came from dead trees. Let's make necessary changes to reflect the dead trees.

In [13]:
#replace null values on health with dead and turn it into ordinal data
trees['health'] = trees.health.fillna('Dead')
#initialize mapping
health_map = {'Dead' : 0, 'Poor' : 1, 'Fair' : 2, 'Good' : 3}
#map into ordinal data
trees['health'].replace(health_map, inplace = True)

#replace null values on species with unidentified
trees['spc_common'] = trees.spc_common.fillna('Unidentified')

Check for duplicates¶

In [14]:
#We will change the spc_common values to have proper nouns
species_capitalize = []
for entry in list(trees.spc_common.str.split()):
    new_entry = []
    for word in entry:
        new_entry.append(word.capitalize())
    to_enter = ' '.join(new_entry)
    to_enter = to_enter.strip()
    species_capitalize.append(to_enter)
trees['spc_common'] = species_capitalize

#check dataframe
trees.head()
Out[14]:
tree_id tree_dbh curb_loc spc_common status health root_stone root_grate root_other trnk_wire ... brch_shoe brch_other postcode nta nta_name latitude longitude alive healthy on_curb
0 190422 11 OnCurb Honeylocust Alive 3 False False False False ... False False 10023 MN14 Lincoln Square 40.770046 -73.984950 True True True
1 190426 11 OnCurb Honeylocust Alive 3 False False False False ... False False 10023 MN14 Lincoln Square 40.770210 -73.985338 True True True
2 208649 9 OnCurb American Linden Alive 3 False True False False ... False False 10019 MN15 Clinton 40.762724 -73.987297 True True True
3 193310 14 OnCurb Honeylocust Alive 3 False False True False ... False True 10021 MN40 Upper East Side-Carnegie Hill 40.772171 -73.960456 True True True
4 199760 4 OnCurb Willow Oak Alive 2 False False False False ... False False 10024 MN12 Upper West Side 40.782087 -73.980964 True False True

5 rows × 23 columns

In [15]:
#check for duplicated rows
trees.duplicated().sum()
Out[15]:
0

There are no duplicated rows. We are now confident that we have thoroughly cleaned our data. We are free to begin analysis.

Analysis¶

What are the most common tree species in Manhattan?¶

We made an assumption from the nta codes that all the trees in the dataset are located at Manhattan, we will make sure of that.

In [16]:
#merge trees data with neighborhoods geodata
trees_neighborhoods = trees.merge(neighborhoods[['boroname', 'ntacode']], left_on = 'nta', right_on = 'ntacode', how = 'left')

#print out unique entries on boroname column
trees_neighborhoods.boroname.unique()
Out[16]:
array(['Manhattan'], dtype=object)

Now that we are sure, let us proceed in finding the most common species

In [17]:
#group trees by species, aggregate by count
trees_grouped_by_species = trees.query('status == "Alive"').groupby('spc_common').agg({'spc_common': 'count'})\
														.rename(columns = {'spc_common' : 'count'})\
														.reset_index()\
														.sort_values('count', ascending = False)
In [18]:
#create visual style
wood_color = '#340006'
sns.set_style('darkgrid', {'grid.color' : wood_color,
                           'axes.facecolor' : 'gray',
                           'figure.facecolor' : 'lightgray'})
In [19]:
#initialize figure
fig = plt.figure (figsize = (7,4))
#set figure title
fig.suptitle('Top 20 most common tree species in Manhattan', color = wood_color)

#create barplot
ax = sns.barplot(data = trees_grouped_by_species.nlargest(20, 'count'), y = 'spc_common', x = 'count', color = 'forestgreen')
ax.set_ylabel('')
ax.set_xlabel('Count')

#show plot
plt.show()

Observations

  • Honeylocust is the most prevalent species, amounting to around 1/6 of the observations. This means that 1 of every 6 trees in Manhattan is Honeylocust.
  • While Callery Pear is only half the amount of Honeylocust, it is still a sizeable amount. We can say that the top 8 in this list makes a sizeable portion of the trees in Manhattan.

Which neighborhoods has the most trees?¶

Create aggregated table with details about each Manhattan neighborhood¶

In [20]:
#create table grouped by neighborhood
trees_grouped_by_neighborhood = trees.groupby('nta').agg({'nta_name': 'count',
                                                               'tree_dbh' : 'mean',
                                                               'health' : 'mean',
                                                               'alive' : 'mean',
                                                              'on_curb' : 'mean',
                                                               'root_stone' :'mean',
                                                               'root_grate' : 'mean',
                                                               'root_other' : 'mean',
                                                               'trnk_wire' : 'mean',
                                                               'trnk_light' : 'mean',
                                                               'trnk_other' : 'mean',
                                                               'brch_light' : 'mean',
                                                               'brch_shoe' : 'mean',
                                                               'brch_other' : 'mean',
                                                              })\
                                                        .reset_index().rename(columns = {'nta_name' : 'trees_count', 'health' : 'average_health'})

#create table with neighborhood names and nta codes
nta_and_name = pd.DataFrame(trees.groupby('nta')['nta_name'].apply(lambda name: max(name)))

#join table with neighbborhood names
trees_grouped_by_neighborhood = nta_and_name.merge(trees_grouped_by_neighborhood, on = 'nta')
In [21]:
#subset neighborhoods table with only Manhattan nbhd entries
Manhattan_neighborhoods = neighborhoods[neighborhoods.ntacode.str.contains('MN')]
#create nbhd centers
Manhattan_neighborhoods['center_lng'] = Manhattan_neighborhoods.geometry.centroid.x
Manhattan_neighborhoods['center_lat'] = Manhattan_neighborhoods.geometry.centroid.y

#join with trees aggregated table, we now have a table with all information about each neighborhood and its aggregated tree data
Manhattan_neighborhoods_and_trees = Manhattan_neighborhoods.merge(trees_grouped_by_neighborhood,
                                                                  left_on = 'ntacode', right_on = 'nta', how = 'right')\
                                                            .drop(['nta', 'nta_name'], axis = 1)
In [22]:
#conversion from square meters to square kilometers
sqm_to_sqkm = 10**6

#create a copy of our main table, change epsg to have meter units
dense_find = Manhattan_neighborhoods_and_trees.copy()
dense_find['geometry'] = Manhattan_neighborhoods_and_trees.geometry.to_crs(epsg = 3857)
#transform shape area to have square kilometer units
dense_find['shape_area'] = dense_find.geometry.area / sqm_to_sqkm
#find density
dense_find['tree_density'] = dense_find['trees_count'] / dense_find['shape_area']

#add tree_density column to dataframe
Manhattan_neighborhoods_and_trees['tree_density'] = dense_find['tree_density']
#change area column values to square kilometer units
Manhattan_neighborhoods_and_trees['area'] = dense_find['shape_area']
#drop shape_area column
Manhattan_neighborhoods_and_trees.drop('shape_area', axis = 1, inplace = True)

#delete frame used to find density and area values
del dense_find

Plot neighborhoods tree counts¶

In [23]:
#ninitialize figure with one row and two columns
fig, ax = plt.subplots (1,2, figsize = (10,3), tight_layout = True)
fig.suptitle('Top Ten Neighborhoods With Most Trees', color = 'forestgreen')

#create barplot with neighborhoods with most trees
ax1 = sns.barplot(data = Manhattan_neighborhoods_and_trees.nlargest(10, 'trees_count'),
                  y = 'ntaname', x = 'trees_count',
                  color = 'forestgreen', ax = ax[0])
ax1.set_title('Top 10 Most trees', color = wood_color)
ax1.set_ylabel('')
ax1.set_xlabel('Number of Trees')

#create barplot with most dense neighborhoods
ax2 = sns.barplot(data = Manhattan_neighborhoods_and_trees.nlargest(10, 'tree_density'),
                  y = 'ntaname', x = 'tree_density',
                  color = 'forestgreen', ax = ax[1])
ax2.set_title('Top 10 Most trees per square kilometer', color = wood_color)
ax2.set_ylabel('')
ax2.set_xlabel('Trees per km\u00B2')

plt.show()
In [24]:
#check for intersection between the two charts above

#change the lists to sets
top_ten_most_dense_nbhd = set(Manhattan_neighborhoods_and_trees.nlargest(10, 'tree_density').ntaname)
top_ten_most_trees_nbhd = set(Manhattan_neighborhoods_and_trees.nlargest(10, 'trees_count').ntaname)

#find intersection
intersection = list(top_ten_most_dense_nbhd.intersection(top_ten_most_trees_nbhd))

#print intersection
print("The Neighborhoods appears on both lists are: " , ', '.join(intersection))
The Neighborhoods appears on both lists are:  Central Harlem North-Polo Grounds, Central Harlem South, West Village, Upper West Side, Upper East Side-Carnegie Hill, Morningside Heights

We will further investigate the neighborhoods of Manhattan and their trees, for now we can say that the most of the neighborhoods with the most trees tend to be the most densed with trees.

A visualization of Manhattan's neighborhoods and trees¶

In [25]:
#import pearson r score
from scipy.stats import pearsonr

#create new column to show if neighborhood is ranked with most number of trees or most dense with trees
#initialize empty list
nbhd_classification = []
#loop through neighborhoods
for nbhd in Manhattan_neighborhoods_and_trees['ntaname']:
    if nbhd in intersection:
        nbhd_classification.append('Top 10 Both')
    elif nbhd in top_ten_most_trees_nbhd:
        nbhd_classification.append('Top 10 Most Trees')
    elif nbhd in top_ten_most_dense_nbhd:
        nbhd_classification.append('Top 10 Most Dense')
    else:
        nbhd_classification.append('Not Ranked')

#insert new column
Manhattan_neighborhoods_and_trees['nbhd_tree_count_classification'] = nbhd_classification

#find Pearson r correlation score
r_score = pearsonr(Manhattan_neighborhoods_and_trees.area, Manhattan_neighborhoods_and_trees.trees_count)
leg_kwds={
'loc': 'upper left',
'bbox_to_anchor':(1, 1.03)}

#initialize figure
fig, ax = plt.subplots(1,1, figsize = (10,5))
fig.suptitle('Trees in Manhattan per square kilometer', color = 'forestgreen')

#create scatterplot
ax = sns.scatterplot(data = Manhattan_neighborhoods_and_trees,x = 'area', y = 'trees_count',
                hue = 'nbhd_tree_count_classification', palette = 'Set1',
                legend = True)

#create regplot, hide scatter points to show plot above, set ci = None to show only best fit line
ax = sns.regplot(data = Manhattan_neighborhoods_and_trees,x = 'area', y = 'trees_count', 
                 scatter_kws = {'alpha' : 0}, line_kws = {'color' : 'yellow', 'alpha' : 0.5}, ci = None)

#set title and labels
ax.set_title('Hued by Ranking', color = wood_color)
ax.set_xlabel('Area (km\u00B2)', color = wood_color)
ax.set_ylabel('Number of Trees', color = wood_color)
ax.text(5.05, 4000, f'Pearson r = {r_score[0].round(2)}', color = 'white')

#position legend outside plot
ax.legend(bbox_to_anchor = (-0.7,1),loc = 'upper left')

plt.legend(bbox_to_anchor = (1, 1.03), loc = 'upper left')
plt.show()

Pearson r score

  • It is a number that indicates correlation between two variables.
  • It ranges from -1 to +1.
  • A number close to -1 indicates stronger negative correlation. Where one variable increases while the other decreases.
  • A number close to 0 indicates weak or no corration if 0.
  • A number close to +1 indicates stronger positive correlation. Where both variables increase or decrease together.

Observations

  • We could see that there is a medium positive correlation between the neighborhoods area and the number of trees in it. Correlation is strongest and most positive on the most dense neighborhoods, but that is not a surprise.
  • Most neighborhoods has under 3000 trees, including some large neighborhoods. We will look at whether the amount of harm done on the trees lessen those numbers.
In [26]:
#add up the values of the problem on trees
harm_rating = Manhattan_neighborhoods_and_trees[to_bool]
Manhattan_neighborhoods_and_trees['overall_problem'] = harm_rating.sum(axis = 1)

#initialize green palette
green_pal = sns.light_palette('green', as_cmap = True)

#initialize figure with 1 row and 2 columns
fig, ax = plt.subplots(1,2, figsize = (10,5), sharey = True)
fig.suptitle('Trees in Manhattan per Square Kilometer', color = 'forestgreen')

#create 1st plot
ax1 = sns.scatterplot(data = Manhattan_neighborhoods_and_trees,x = 'area', y = 'trees_count',
                hue = 'nbhd_tree_count_classification', palette = 'Set1',
                legend = True, ax = ax[0])
ax1.legend(bbox_to_anchor = (-0.75,1),loc = 'upper left')
ax1.set_title('Hued by Ranking', color = wood_color)
ax1.set_xlabel('Area (km\u00B2)')
ax1.legend(bbox_to_anchor = (-0.1, 1.03), loc = 'upper right')

#create temporary dataframe with average tree problems for each neighborhood
harm_rating = Manhattan_neighborhoods_and_trees[to_bool]
#add values for each row, we now have average overall problem for each tree for each neighborhood
Manhattan_neighborhoods_and_trees['overall_problem'] = harm_rating.sum(axis = 1)

#initialize light green palette
green_pal = sns.light_palette('forestgreen', as_cmap = True)

#create second plot
ax2 = sns.scatterplot(data = Manhattan_neighborhoods_and_trees, x = 'area', y = 'trees_count',
                hue = 'average_health', palette = green_pal,
                size = 'overall_problem', ax = ax[1])
ax2.set_title('Hued by Health, Sized by Overall Problems')
ax2.set_xlabel('Area (km\u00B2)')
ax2.legend(bbox_to_anchor = (1,1.03))

#show plots
plt.show()

Observation

  • The neighborhoods that was not ranked tend to have more overall problems with their trees.
  • This could mean that the neighborhoods who do not provide good care for the trees tend to have less trees.
  • The fainter dots tend to be the bigger ones, this means that care on the trees tend to have an effect on its health.
In [27]:
#import folium
import folium

#find center of Manhattan, we need this to center our map
centerpoint = [np.median(trees.latitude) + .01, np.median(trees.longitude)- .01]

#initialize folium map
Manhattan = folium.Map(location = centerpoint, 
                       zoom_start = 11.5, 
                       width = 700, height = 800,
                       zoom_control=False,
                       scrollWheelZoom=False,
                       dragging=False,
                      legend = 'Manhattan',
                      tiles = 'cartodbpositron')
 
#create base map for initial view of map
base_map = folium.FeatureGroup(name = 'base', overlay = False, control = False).add_to(Manhattan)
folium.TileLayer(tiles = 'cartodbpositron').add_to(base_map)
base_map.add_to(Manhattan)

#create popups that show neighborhood name, and number of trees
for lab, row in Manhattan_neighborhoods_and_trees.iterrows():
    location = [row.center_lat, row.center_lng]
    popup = row.ntaname + ', trees: ' + str(row.trees_count)
    folium.Marker(location = location, popup = popup).add_to(base_map)


#create choropleth layer based on Trees per Square kilometer
layer1 = folium.Choropleth(geo_data = Manhattan_neighborhoods_and_trees,
                     name = 'Trees per Square Kilometer', 
                     data = Manhattan_neighborhoods_and_trees,
                     columns = ['ntaname', 'tree_density'],
                    key_on = 'feature.properties.ntaname',
                    fill_color = 'BuGn', fill_opacity = 20,
                    legend_name = 'Trees per square kilometer by neighborhood', overlay = False).add_to(Manhattan)

#create choropleth layer based on number of trees
layer2 = folium.Choropleth(geo_data = Manhattan_neighborhoods_and_trees,
                     name = 'Tree Count', 
                     data = Manhattan_neighborhoods_and_trees,
                     columns = ['ntaname', 'trees_count'],
                    key_on = 'feature.properties.ntaname',
                    fill_color = 'BuGn', fill_opacity = 20,
                    legend_name = 'Trees by neighborhood', overlay = False).add_to(Manhattan)

#create choropleth layer based on average tree diameter
layer3 =  Manhattan.choropleth(geo_data = Manhattan_neighborhoods_and_trees,
                     name = 'Average Tree Diameter', 
                     data = Manhattan_neighborhoods_and_trees,
                     columns = ['ntaname', 'tree_dbh'],
                    key_on = 'feature.properties.ntaname',
                    fill_color = 'BuGn', fill_opacity = 20,
                    legend_name = 'Average Tree Diameter', overlay = False)

#create choropleth layer based on tree health
layer4 =  Manhattan.choropleth(geo_data = Manhattan_neighborhoods_and_trees,
                     name = 'Average Tree Health', 
                     data = Manhattan_neighborhoods_and_trees,
                     columns = ['ntaname', 'average_health'],
                    key_on = 'feature.properties.ntaname',
                    fill_color = 'BuGn', fill_opacity = 20,
                    legend_name = 'Average Tree Health', overlay = False)

#create choropleth layer based on average tree problems
harm_layer =  folium.Choropleth(geo_data = Manhattan_neighborhoods_and_trees,
                     name = 'Average Problem per Tree', 
                     data = Manhattan_neighborhoods_and_trees,
                     columns = ['ntaname', 'overall_problem'],
                    key_on = 'feature.properties.ntaname',
                    fill_color = 'PuRd', fill_opacity = 20,
                    legend_name = 'Average Problem per Tree', overlay = False, legend = False).add_to(Manhattan)

#create choropleth layer based on percentege of trees planted on curb
curb_layer =  folium.Choropleth(geo_data = Manhattan_neighborhoods_and_trees,
                     name = 'Average on curb rate', 
                     data = Manhattan_neighborhoods_and_trees,
                     columns = ['ntaname', 'on_curb'],
                    key_on = 'feature.properties.ntaname',
                    fill_color = 'Greys', fill_opacity = 20,
                    legend_name = 'Average on curb rate', overlay = False, legend = False).add_to(Manhattan)

#add in layer control for view control
folium.LayerControl(collapsed = False, position = 'topright').add_to(Manhattan)

#display folium map
Manhattan
Out[27]:
Make this Notebook Trusted to load map: File -> Trust Notebook

On using the map

  • If map does not show, please open on Jupyter Notebook.
  • Clicking on each layer shows a choropleth that colors the map by the specific feature.
  • The rectangular area in the middle is Central Park.

Observations

  • The neighborhoods with the highest average tree health does not necessarily has the trees with the biggest diameter. This could be because of the species of trees in each area.
  • The shades on the layers about tree health, diameter, and abundance are similar, while it reverses when we look at the average problem per tree.
  • A vast majority of the neighborhoods tend to plant most of their trees on curb.

Ten species recommendation¶

What to find?

  • From the analysis above, the neighborhoods that are dense with trees have apply less harm to their trees. If the city will plant a tree, it would most likely be on the neighborhoods with less amount of trees. Hence, we need to find a species that can withstand the harm done with it.

Which species are most common on neighborhoods most densed with trees?¶

In [28]:
#group trees by neighborhood, then species, filter to neighborhoods in most dense ranking
trees_grouped_by_nbhd = trees[trees.nta_name.isin(top_ten_most_dense_nbhd)].groupby(['nta_name', 'spc_common'])\
                                .agg({'spc_common' : 'count'})\
                                .rename(columns = {'spc_common' : 'count'}).sort_values('count', ascending = False)

#find the top 5 most prevalent trees in each neighborhood
trees_grouped_by_nbhd_most_common_species = trees_grouped_by_nbhd.groupby(level = 'nta_name').head(5).sort_values(['nta_name', 'count'], ascending = [True, False])

#get species names, input to list
most_common_trees_in_most_densed_nbhd = list(trees_grouped_by_nbhd_most_common_species.reset_index().spc_common.unique())

#show dataframe
trees_grouped_by_nbhd_most_common_species
Out[28]:
count
nta_name spc_common
Central Harlem North-Polo Grounds Honeylocust 640
Callery Pear 451
Pin Oak 273
Sophora 269
London Planetree 265
Central Harlem South Honeylocust 441
Ginkgo 280
London Planetree 252
Callery Pear 234
Japanese Zelkova 196
East Village Honeylocust 322
Callery Pear 197
Sophora 150
Ginkgo 143
Littleleaf Linden 101
Gramercy Honeylocust 231
Ginkgo 185
Callery Pear 166
Sophora 94
London Planetree 91
Hamilton Heights Pin Oak 426
Honeylocust 385
Sophora 166
London Planetree 140
Callery Pear 127
Morningside Heights Honeylocust 391
Ginkgo 371
American Elm 296
Callery Pear 232
Pin Oak 225
Upper East Side-Carnegie Hill Callery Pear 828
Honeylocust 725
Ginkgo 569
American Elm 390
Littleleaf Linden 286
Upper West Side Honeylocust 972
Pin Oak 809
Ginkgo 533
Sophora 492
Callery Pear 462
West Village Callery Pear 568
Honeylocust 560
Ginkgo 535
Sophora 280
London Planetree 246
Yorkville Honeylocust 680
Callery Pear 286
Ginkgo 188
Littleleaf Linden 147
Sophora 129

Recommended Trees¶

In [29]:
#create overall_harm for each tree
trees['overall_harm'] = trees[to_bool].sum(axis = 1)

#create dataframe of trees grouped by species
trees_by_species = trees.groupby('spc_common').agg({'tree_dbh' : 'mean',
                                                    'spc_common' : 'count',
                                                    'tree_dbh' : 'mean',
                                                    'health' : 'mean',
                                                    'overall_harm' : 'mean',
                                                    'on_curb' : 'mean'})

#rename spc_common to species
trees_by_species.index.rename('species', inplace = True)
#rename spc_common columns that contatins the count of the species with count
trees_by_species.rename(columns = {'spc_common' : 'count'}, inplace = True)
In [30]:
#create red_pal
red_pal = sns.light_palette('red', as_cmap = True)

#We will pick trees that have count more than 15 to ensure that the trees recommended are not rare
sns.scatterplot(data = trees_by_species.query('count > 15'), x = 'tree_dbh' , y = 'health',
                hue = 'overall_harm', palette = red_pal)
plt.title('Species with count more than 20', color = 'forestgreen')
plt.show()

Observations

  • Notice that the average health of most species in this set is higher than 2.0, this means that most trees under these species are in better than fair health.

Strategy

  • We want to look for trees that can stay healthy while exposed to more problems. We also want trees that are proven to be survive being planted on curb.
  • We will filter out by using overall_harm to have species that can withstand harm.Then as decision maker we will use the trees with the largest diameters
In [31]:
#import termcolor, so we can color the trees we recommend
from termcolor import colored

#filter trees_by_species to show trees that have the specs that we need, get 10 largest in terms of trunk diameter.
trees_to_plant = trees_by_species.query('(count > 15) and (health > 2.5) and (tree_dbh > 7) and (overall_harm > 0.4) and (on_curb > 0.5)').nlargest(10, 'tree_dbh')

#show recommendations
print(colored("The trees we recommend that the city should plant are:",'green'))
print(colored(', '.join(list(trees_to_plant.index)), 'green'))
The trees we recommend that the city should plant are:
American Elm, London Planetree, Siberian Elm, Ohio Buckeye, Tree Of Heaven, Mulberry, Willow Oak, Pin Oak, Black Locust, Black Walnut
In [32]:
#show recommendations with their features
trees_to_plant = trees_to_plant.round(2)
trees_to_plant['label'] = 'recommended'
trees_to_plant
Out[32]:
tree_dbh count health overall_harm on_curb label
species
American Elm 13.90 1698 2.76 0.43 0.72 recommended
London Planetree 13.17 4122 2.54 0.48 0.83 recommended
Siberian Elm 12.06 156 2.79 0.42 0.92 recommended
Ohio Buckeye 11.96 24 2.71 0.88 0.96 recommended
Tree Of Heaven 11.45 104 2.74 0.78 0.96 recommended
Mulberry 11.00 68 2.63 0.46 0.88 recommended
Willow Oak 10.81 889 2.81 0.59 0.97 recommended
Pin Oak 10.07 4584 2.78 0.47 0.91 recommended
Black Locust 9.77 259 2.75 0.54 0.98 recommended
Black Walnut 9.64 33 2.76 0.58 1.00 recommended
In [33]:
#most common trees on nbhds that are most densed with trees
trees_0 = trees_by_species[trees_by_species.index.isin(most_common_trees_in_most_densed_nbhd)].round(2)
trees_0['label'] = 'on most dense'

#show most prevalent trees with their features
trees_0
Out[33]:
tree_dbh count health overall_harm on_curb label
species
American Elm 13.90 1698 2.76 0.43 0.72 on most dense
Callery Pear 8.68 7297 2.68 0.64 0.97 on most dense
Ginkgo 8.45 5859 2.66 0.56 0.96 on most dense
Honeylocust 9.06 13176 2.82 0.70 0.95 on most dense
Japanese Zelkova 7.04 3596 2.71 0.51 0.95 on most dense
Littleleaf Linden 7.87 3333 2.48 0.55 0.94 on most dense
London Planetree 13.17 4122 2.54 0.48 0.83 on most dense
Pin Oak 10.07 4584 2.78 0.47 0.91 on most dense
Sophora 9.23 4453 2.76 0.57 0.96 on most dense
In [34]:
#show trees that appeared on both recommendation and most prevalent on most dense nbhds
print(f'The trees that we recommended that are most prevalent on the neighborhoods with most trees are ', ', '.join(list(set(trees_to_plant.index).intersection(set(trees_0.index)))))
The trees that we recommended that are most prevalent on the neighborhoods with most trees are  London Planetree, American Elm, Pin Oak

Comparing recommended trees to trees most prevalent in most densed neighborhoods¶

In [35]:
#concatened both tables, this is to make swarmplots
joined = pd.concat([trees_to_plant, trees_0])

#initiate figure
fig, ax = plt.subplots(1, 5, figsize = (15,5))

#set title
fig.suptitle('Comparison of Recommended to Prevalent Trees', color = wood_color)

#loop through columns to make swarmplots
for i, col in enumerate(['tree_dbh', 'count', 'health', 'overall_harm', 'on_curb']):
    ax[i] = sns.swarmplot(data = joined, y = col, x = 'label', ax = ax[i], palette = ['limegreen', wood_color])
    ax[i].set_title(col)
    ax[i].set_xlabel('')
    ax[i].set_ylabel('')

#show plot
plt.show()

Observations

  • The recommended trees do have greater trunk diameter than the most prevalent trees.
  • The recommended trees are not prevalent, which might be good as it increases diversity.
  • The recommended trees tend to have greater average health, with similar overall harm levels to the most prevalent trees.
  • Since most trees were planted on curb, they also share similar levels on the proportion planted on curb.

Reccomendations:¶

In [36]:
print(colored("The trees we recommend that the city should plant are:",'green'))
print(colored(', '.join(list(trees_to_plant.index)), color = 'green'))
The trees we recommend that the city should plant are:
American Elm, London Planetree, Siberian Elm, Ohio Buckeye, Tree Of Heaven, Mulberry, Willow Oak, Pin Oak, Black Locust, Black Walnut