其他
Pandas & SQL 语法归纳总结,真的太全了
The following article is from 关于数据分析与可视化 Author 俊欣
Pandas
与SQL
可能是大家用的比较多的两个工具,两者都可以对数据集进行深度的分析,挖掘出有价值的信息,但是二者的语法有着诸多的不同,今天小编就来总结归纳一下Pandas
与SQL
这两者之间在语法上到底有哪些不同。导入数据
Pandas
而言,我们需要提前导入数据集,然后再进行进一步的分析与挖掘airports = pd.read_csv('data/airports.csv')
airport_freq = pd.read_csv('data/airport-frequencies.csv')
runways = pd.read_csv('data/runways.csv')
基础语法
SQL
当中,我们用SELECT
来查找数据,WHERE
来过滤数据,DISTINCT
来去重,LIMIT
来限制输出结果的数量,select * from airports
## Pandas
airports
select * from airports limit 3
## Pandas
airports.head(3)
select id from airports where ident = 'KLAX'
## Pandas
airports[airports.ident == 'KLAX'].id
select distinct type from airport
## Pandas
airports.type.unique()
多个条件交集来筛选数据
select * from airports
where iso_region = 'US-CA' and
type = 'seaplane_base'
## Pandas
airports[(airports.iso_region == 'US-CA') &
(airports.type == 'seaplane_base')]
select ident, name, municipality from airports
where iso_region = 'US-CA' and
type = 'large_airport'
## Pandas
airports[(airports.iso_region == 'US-CA') &
(airports.type == 'large_airport')][['ident', 'name', 'municipality']]
排序
Pandas
当中默认是对数据进行升序排序,要是我们希望对数据进行降序排序,需要设定ascending
参数select * from airport_freq
where airport_ident = 'KLAX'
order by type
## Pandas
airport_freq[airport_freq.airport_ident == 'KLAX']
.sort_values('type')
select * from airport_freq
where airport_ident = 'KLAX'
order by type desc
## Pandas
airport_freq[airport_freq.airport_ident == 'KLAX']
.sort_values('type', ascending=False)
筛选出列表当中的数据
isin()
方法,代码如下select * from airports
where type in ('heliport', 'balloonport')
## Pandas
airports[airports.type.isin(['heliport', 'balloonport'])]
select * from airports
where type not in ('heliport', 'balloonport')
## Pandas
airports[~airports.type.isin(['heliport', 'balloonport'])]
删除数据
Pandas
当中删除数据用的是drop()
方法,代码如下delete from dataframe where col_name = 'MISC'
## Pandas
df = df[df.type != 'MISC']
df.drop(df[df.type == 'MISC'].index)
更新数据
SQL
当中更新数据使用的是update
和set
方法,代码如下update airports set home_link = '......'
where ident == 'KLAX'
### Pandas
airports.loc[airports['ident'] == 'KLAX', 'home_link'] = '......'
调用统计函数
min()
、max()
、mean()
以及median()
函数作用于length_ft
这一列上面,代码如下select max(length_ft), min(length_ft),
avg(length_ft), median(length_ft) from runways
## Pandas
runways.agg({'length_ft': ['min', 'max', 'mean', 'median']})
合并两表格
Pandas
当中合并表格用的是pd.concat()
方法,在SQL
当中则是UNION ALL
,代码如下select name, municipality from airports
where ident = 'KLAX'
union all
select name, municipality from airports
where ident = 'KLGB'
## Pandas
pd.concat([airports[airports.ident == 'KLAX'][['name', 'municipality']],
airports[airports.ident == 'KLGB'][['name', 'municipality']]])
分组
groupby()
方法,代码如下select iso_country, type, count(*) from airports
group by iso_country, type
order by iso_country, type
## Pandas
airports.groupby(['iso_country', 'type']).size()
分组之后再做筛选
Pandas
当中是在进行了groupby()
之后调用filter()
方法,而在SQL
当中则是调用HAVING
方法,代码如下select type, count(*) from airports
where iso_country = 'US'
group by type
having count(*) > 1000
order by count(*) desc
## Pandas
airports[airports.iso_country == 'US']
.groupby('type')
.filter(lambda g: len(g) > 1000)
.groupby('type')
.size()
.sort_values(ascending=False)
TOP N records
select 列名 from 表名
order by size
desc limit 10
## Pandas
表名.nlargest(10, columns='列名')
技术技术
分享
点收藏
点点赞
点在看