本文主要是:新冠肺炎COVID-19 State Data Set的简单分析与处理

Step 1

Use Pandas to load COVID-19 State Data Set as the dataframe.
Pandas

由于数据持续更新,所以下载链接仅供参考,链接:https://pan.baidu.com/s/1npzHaEX5DUudB1yTTm8hyQ
提取码:qy8i

若需要请到此,下载最新数据COVID-19 State Data Set

1
2
3
4
import pandas as pd
data = pd.read_csv('COVID19_state.csv')
df = pd.DataFrame(data)
df

State Tested Infected Deaths Population Pop Density Gini ICU Beds Income GDP ... Hospitals Health Spending Pollution Med-Large Airports Temperature Urban Age 0-25 Age 26-54 Age 55+ School Closure Date
0 Alaska 9655 314 9 734002 1.2863 0.4081 119 59687 73205 ... 21 11064 6.4 1.0 26.60 66.0 0.36 0.39 0.25 03/19/20
1 Alabama 42538 4723 151 4908621 96.9221 0.4847 1533 42334 45219 ... 101 7281 8.1 1.0 62.80 59.0 0.33 0.37 0.31 03/16/20
2 Arkansas 24141 1739 38 3038999 58.4030 0.4719 732 42566 42454 ... 88 7408 7.1 0.0 60.40 56.2 0.34 0.37 0.30 03/17/20
3 Arizona 51045 4719 177 7378494 64.9550 0.4713 1559 43650 48055 ... 83 6452 9.7 1.0 60.30 89.8 0.33 0.36 0.30 03/16/20
4 California 266900 28963 1072 39937489 256.3727 0.4899 7338 62586 74205 ... 359 7549 12.8 9.0 59.40 95.0 0.33 0.40 0.26 03/19/20
5 Colorado 44606 9433 411 5845526 56.4011 0.4586 1597 56846 63882 ... 89 6804 6.7 1.0 45.10 86.2 0.33 0.40 0.27 03/23/20
6 Connecticut 58213 17550 1086 3563077 735.8689 0.4945 674 74561 76342 ... 32 9859 7.2 1.0 49.00 88.0 0.30 0.38 0.32 03/17/20
7 District of Columbia 13268 2666 91 720687 11814.5410 0.5420 314 47285 200277 ... 10 11944 9.8 0.0 54.65 100.0 0.30 0.48 0.22 03/16/20
8 Delaware 14794 2538 67 982895 504.3073 0.4522 186 51449 77253 ... 7 10254 8.3 0.0 55.30 83.3 0.30 0.37 0.33 03/16/20
9 Florida 253183 25492 748 21992985 410.1256 0.4852 5604 49417 48318 ... 217 8076 7.4 7.0 70.70 91.2 0.29 0.37 0.35 03/16/20
10 Georgia 74208 17841 677 10736059 186.6719 0.4813 2508 45745 55832 ... 145 6587 8.3 1.0 63.50 75.1 0.35 0.39 0.26 03/18/20
11 Hawaii 23215 574 9 1412687 219.9419 0.4420 201 54565 64096 ... 22 7299 5.4 2.0 70.00 91.9 0.30 0.37 0.32 03/23/20
12 Iowa 22947 2513 74 3179849 56.9284 0.4451 545 48823 59977 ... 118 8200 7.1 0.0 47.80 64.0 0.34 0.36 0.30 NaN
13 Idaho 16869 1668 44 1826156 22.0969 0.4503 314 43155 43430 ... 45 6927 6.8 0.0 44.40 70.6 0.36 0.36 0.28 03/23/20
14 Illinois 137404 29160 1259 12659682 228.0243 0.4810 3144 56933 67268 ... 187 8262 9.3 2.0 51.80 88.5 0.33 0.38 0.28 03/17/20
15 Indiana 56873 10641 545 6745354 188.2810 0.4527 1861 46646 55172 ... 132 8300 8.4 1.0 51.70 72.4 0.34 0.37 0.29 03/19/20
16 Kansas 17676 1790 86 2910357 35.5968 0.4550 767 50155 56334 ... 139 7651 7.0 0.0 54.30 74.2 0.35 0.36 0.29 03/18/20
17 Kentucky 32225 2707 144 4499692 113.9566 0.4813 1392 41779 46898 ... 105 8004 8.1 1.0 55.60 58.4 0.33 0.38 0.30 03/16/20
18 Louisiana 137999 23580 1267 4645184 107.5175 0.4990 1289 45542 53589 ... 158 7815 7.9 1.0 66.40 73.2 0.34 0.37 0.28 03/16/20
19 Massachusetts 156806 36372 1560 6976597 894.4355 0.4786 1326 70073 82480 ... 75 10559 6.3 1.0 47.90 92.0 0.30 0.39 0.31 03/17/20
20 Maryland 65370 12308 463 6083116 626.6731 0.4499 1134 62914 68573 ... 50 8602 7.7 1.0 54.20 87.2 0.31 0.39 0.29 03/16/20
21 Maine 4241 847 32 1345790 43.6336 0.4519 256 48241 47969 ... 34 9531 5.9 0.0 41.00 38.7 0.26 0.37 0.37 NaN
22 Michigan 107791 30791 2308 10045029 177.6655 0.4695 2423 47582 53209 ... 144 8055 8.0 1.0 44.40 74.6 0.32 0.37 0.31 03/16/20
23 Minnesota 44368 2213 121 5700671 71.5922 0.4496 1171 56374 64675 ... 127 8871 6.6 1.0 41.20 73.3 0.32 0.38 0.30 03/18/20
24 Missouri 53525 5517 175 6169270 89.7453 0.4646 1888 46635 51699 ... 122 8107 7.5 2.0 54.60 70.4 0.33 0.37 0.31 03/19/20
25 Mississippi 37733 3974 152 2989260 63.7056 0.4828 824 37994 37948 ... 99 7646 7.7 0.0 63.40 49.4 0.35 0.36 0.29 03/20/20
26 Montana 10569 426 10 1086759 7.4668 0.4667 165 47120 46609 ... 56 8221 6.6 0.0 42.70 55.9 0.31 0.35 0.34 03/16/20
27 North Carolina 76211 6140 164 10611862 218.2702 0.4780 2227 45834 54441 ... 112 7264 7.2 2.0 59.00 66.1 0.32 0.38 0.29 03/16/20
28 North Dakota 12963 528 9 761723 11.0393 0.4533 238 54306 72597 ... 39 9851 4.6 0.0 40.40 59.9 0.35 0.37 0.28 03/16/20
29 Nebraska 13753 1138 24 1952570 25.4161 0.4477 440 52110 63942 ... 93 8412 7.1 1.0 48.80 73.1 0.35 0.37 0.29 NaN
30 New Hampshire 13424 1342 38 1371246 153.1605 0.4304 242 61405 63067 ... 28 9589 4.4 0.0 43.80 60.3 0.28 0.37 0.34 03/16/20
31 New Jersey 162536 81420 4070 8936574 1215.1991 0.4813 1822 67609 69378 ... 82 8859 8.1 1.0 52.70 94.7 0.31 0.38 0.30 03/18/20
32 New Mexico 36632 1798 51 2096640 17.2850 0.4769 340 41198 46954 ... 41 7214 6.0 1.0 53.40 77.4 0.33 0.36 0.31 03/16/20
33 Nevada 30751 3626 155 3139658 28.5993 0.4577 900 48225 55269 ... 44 6714 9.0 1.0 49.90 94.2 0.32 0.40 0.29 03/16/20
34 New York 596532 236732 12192 19440469 412.5211 0.5229 3952 68667 85746 ... 166 9778 6.6 3.0 45.40 87.9 0.31 0.39 0.30 03/18/20
35 Ohio 83131 10222 451 11747694 287.5038 0.4680 3314 48242 57492 ... 194 8712 8.5 3.0 50.70 77.9 0.32 0.37 0.31 03/17/20
36 Oklahoma 35561 2465 136 3954821 57.6547 0.4645 1064 46128 50613 ... 125 7627 8.2 0.0 59.60 66.2 0.35 0.37 0.29 03/17/20
37 Oregon 37583 1844 72 4301089 44.8086 0.4583 659 49908 56956 ... 61 8044 7.8 1.0 48.40 81.0 0.30 0.39 0.31 03/16/20
38 Pennsylvania 153965 31069 836 12820878 286.5449 0.4689 3169 55349 61594 ... 199 9258 9.2 2.0 48.80 78.7 0.30 0.37 0.32 03/16/20
39 Rhode Island 10933 1118 60 1056161 1021.4323 0.4781 279 54523 57852 ... 11 9551 7.3 0.0 50.10 90.7 0.29 0.39 0.32 03/16/20
40 South Carolina 32826 4491 137 5210095 173.3174 0.4735 1225 42736 45280 ... 69 7311 7.4 0.0 62.40 66.3 0.32 0.36 0.32 03/16/20
41 South Dakota 38833 4246 119 903027 11.9116 0.4495 152 50141 58624 ... 57 8933 5.1 0.0 45.20 56.7 0.35 0.35 0.30 03/16/20
42 Tennessee 11661 1542 7 6897576 167.2748 0.4790 2209 47179 53933 ... 115 7372 7.4 1.0 57.60 66.4 0.33 0.38 0.29 03/20/20
43 Texas 90586 6762 145 29472295 112.8204 0.4800 6199 49161 61167 ... 523 6998 8.3 6.0 64.80 84.7 0.36 0.39 0.24 03/23/20
44 Utah 176239 18260 453 3282115 39.9430 0.4063 565 45340 55550 ... 54 5982 8.4 1.0 48.60 90.6 0.42 0.37 0.21 03/16/20
45 Virginia 59944 2931 25 8626207 218.4403 0.4705 1654 56952 62563 ... 96 7556 6.9 2.0 55.10 75.5 0.33 0.38 0.29 03/16/20
46 Vermont 51931 8053 258 628061 68.1416 0.4539 94 53598 53523 ... 14 10190 5.1 0.0 42.90 38.9 0.27 0.36 0.36 03/18/20
47 Washington 12116 779 35 7797095 117.3272 0.4591 1265 60781 74182 ... 92 7913 8.0 1.0 48.30 84.1 0.31 0.40 0.29 03/17/20
48 Wisconsin 131984 11802 624 5851754 108.0497 0.4498 1159 50756 57720 ... 133 8702 6.8 1.0 43.10 70.2 0.32 0.37 0.31 03/18/20
49 West Virginia 48161 4199 211 1778070 73.9691 0.4711 653 40578 43053 ... 56 9462 7.6 0.0 51.80 48.7 0.29 0.36 0.35 03/16/20
50 Wyoming 19794 825 18 567025 5.8400 0.4360 102 60095 69900 ... 29 8320 5.0 0.0 42.00 64.8 0.32 0.36 0.31 03/20/20

51 rows × 26 columns

1
df.keys()
Index(['State', 'Tested', 'Infected', 'Deaths', 'Population', 'Pop Density',
       'Gini', 'ICU Beds', 'Income', 'GDP', 'Unemployment', 'Sex Ratio',
       'Smoking Rate', 'Flu Deaths', 'Respiratory Deaths', 'Physicians',
       'Hospitals', 'Health Spending', 'Pollution', 'Med-Large Airports',
       'Temperature', 'Urban', 'Age 0-25', 'Age 26-54', 'Age 55+',
       'School Closure Date'],
      dtype='object')

Step 2

Get 20 data items as sample randomly and show them.

1
2
df1 = df.sample(frac=0.4)
df1

State Tested Infected Deaths Population Pop Density Gini ICU Beds Income GDP ... Hospitals Health Spending Pollution Med-Large Airports Temperature Urban Age 0-25 Age 26-54 Age 55+ School Closure Date
26 Montana 10569 426 10 1086759 7.4668 0.4667 165 47120 46609 ... 56 8221 6.6 0.0 42.7 55.9 0.31 0.35 0.34 03/16/20
22 Michigan 107791 30791 2308 10045029 177.6655 0.4695 2423 47582 53209 ... 144 8055 8.0 1.0 44.4 74.6 0.32 0.37 0.31 03/16/20
15 Indiana 56873 10641 545 6745354 188.2810 0.4527 1861 46646 55172 ... 132 8300 8.4 1.0 51.7 72.4 0.34 0.37 0.29 03/19/20
11 Hawaii 23215 574 9 1412687 219.9419 0.4420 201 54565 64096 ... 22 7299 5.4 2.0 70.0 91.9 0.30 0.37 0.32 03/23/20
14 Illinois 137404 29160 1259 12659682 228.0243 0.4810 3144 56933 67268 ... 187 8262 9.3 2.0 51.8 88.5 0.33 0.38 0.28 03/17/20
6 Connecticut 58213 17550 1086 3563077 735.8689 0.4945 674 74561 76342 ... 32 9859 7.2 1.0 49.0 88.0 0.30 0.38 0.32 03/17/20
41 South Dakota 38833 4246 119 903027 11.9116 0.4495 152 50141 58624 ... 57 8933 5.1 0.0 45.2 56.7 0.35 0.35 0.30 03/16/20
49 West Virginia 48161 4199 211 1778070 73.9691 0.4711 653 40578 43053 ... 56 9462 7.6 0.0 51.8 48.7 0.29 0.36 0.35 03/16/20
28 North Dakota 12963 528 9 761723 11.0393 0.4533 238 54306 72597 ... 39 9851 4.6 0.0 40.4 59.9 0.35 0.37 0.28 03/16/20
25 Mississippi 37733 3974 152 2989260 63.7056 0.4828 824 37994 37948 ... 99 7646 7.7 0.0 63.4 49.4 0.35 0.36 0.29 03/20/20
42 Tennessee 11661 1542 7 6897576 167.2748 0.4790 2209 47179 53933 ... 115 7372 7.4 1.0 57.6 66.4 0.33 0.38 0.29 03/20/20
30 New Hampshire 13424 1342 38 1371246 153.1605 0.4304 242 61405 63067 ... 28 9589 4.4 0.0 43.8 60.3 0.28 0.37 0.34 03/16/20
29 Nebraska 13753 1138 24 1952570 25.4161 0.4477 440 52110 63942 ... 93 8412 7.1 1.0 48.8 73.1 0.35 0.37 0.29 NaN
35 Ohio 83131 10222 451 11747694 287.5038 0.4680 3314 48242 57492 ... 194 8712 8.5 3.0 50.7 77.9 0.32 0.37 0.31 03/17/20
3 Arizona 51045 4719 177 7378494 64.9550 0.4713 1559 43650 48055 ... 83 6452 9.7 1.0 60.3 89.8 0.33 0.36 0.30 03/16/20
12 Iowa 22947 2513 74 3179849 56.9284 0.4451 545 48823 59977 ... 118 8200 7.1 0.0 47.8 64.0 0.34 0.36 0.30 NaN
24 Missouri 53525 5517 175 6169270 89.7453 0.4646 1888 46635 51699 ... 122 8107 7.5 2.0 54.6 70.4 0.33 0.37 0.31 03/19/20
31 New Jersey 162536 81420 4070 8936574 1215.1991 0.4813 1822 67609 69378 ... 82 8859 8.1 1.0 52.7 94.7 0.31 0.38 0.30 03/18/20
20 Maryland 65370 12308 463 6083116 626.6731 0.4499 1134 62914 68573 ... 50 8602 7.7 1.0 54.2 87.2 0.31 0.39 0.29 03/16/20
4 California 266900 28963 1072 39937489 256.3727 0.4899 7338 62586 74205 ... 359 7549 12.8 9.0 59.4 95.0 0.33 0.40 0.26 03/19/20

20 rows × 26 columns

Step 3

Show 10 data items which the Deaths are more than 100 as sample randomly.

1
2
3
df2 = df[df['Deaths']>100]
df3 = df2.sample(n = 10)
df3

Step 4

Sort the data by GDP and present the top 20 data items.

1
2
df4= df.sort_values(by=['GDP'])
df4.iloc[0:20,]

Step 5

Show the simple statistical information (mean, std, min, max, quartile1, quartile2, quartile3).

1
2
3
#1.mean
print(df.mean())# 列平均
print(df.mean(1))# 行平均
1
2
3
# 2.std
print(df.std()) # 列std
print(df.std(1))
1
2
# 3.min
print(df.min())
1
2
# 4.df.min()
df.max()
1
2
# 5.quartile1
df.quantile()
1
2
# 6.
df.quantile(q=0.5)
1
2
# 7.
df.quantile(q=0.75)

Step 6

Use matplotlib show 2D images about data.

Matplotlib

Plot the distribution of two class (1. GDP < 58000, 2. GDP >= 58000) of COVID-19 State Data using different colors and different marker where x-axis is the Pollution and y-axis the Mortality-rate.

1
2
3
4
5
6
7
8
9
10
11
import matplotlib.pyplot as plt

df_GDP1 = df[df['GDP']<58000]
df_GDP2 = df[df['GDP']>=58000]
plt.figure()
ax1 = df_GDP1.plot(x='Pollution',y='Deaths')
df_GDP2.plot(x='Pollution',y='Deaths',ax=ax1)
# 为图表添加标题
plt.title("testddddd")

plt.show()
<Figure size 432x288 with 0 Axes>

png

Step 7

Show the proportion of three class of COVID-19 State Data using pie chart.

About the class:


1、 Mortality-rate < 0.02


2、 Mortality-rate between 0.02 and 0.03


3、Mortality-rate > 0.03

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# 死亡率<0.02
sizes= []
df_death1 = df[df['Deaths']/df['Infected']<0.02]
d1= df_death1.shape[0] # 返回行数
sizes.append(d1)
# 死亡率在0.02 and 0.03之间
df_death2 = df[df['Deaths']/df['Infected']<=0.03]
d2 = df_death2.shape[0] - d1
sizes.append(d2)
# 死亡率在0.03之上
df_death3 = df[df['Deaths']/df['Infected']>0.03]
d3 = df_death3.shape[0]
sizes.append(d3)

labels = ['Mortality-rate < 0.02',' Mortality-rate between 0.02 and 0.03','Mortality-rate > 0.03']
plt.pie(sizes,labels=labels,autopct='%1.1f%%',shadow=False,startangle=150)
plt.axis('equal') #该行代码使饼图长宽相等
plt.show()

png

参考链接