Rate Department based on Max Salary using Pyspark

Rate Department Based on Maximum Salary :

import pyspark.sql.functions as f

from pyspark.sql.functions import*

from pyspark.sql.types import *

from pyspark.sql.window import Window

df = spark.read.format("csv").option("header","true").load("file:///Users/furqan/data/emp.txt")

df.show()

+-----+------+-----------+-------+----------+-------+------+

|empno| ename|designation|manager| hire_date|    sal|deptno|

+-----+------+-----------+-------+----------+-------+------+

| 7369| SMITH|      CLERK|   7902|1980-12-17| 800.00|    20|

| 7499| ALLEN|   SALESMAN|   7698|1981-02-20|1600.00|    30|

| 7521|  WARD|   SALESMAN|   7698|1981-02-22|1250.00|    30|

| 7566| JONES|    MANAGER|   7839|1981-04-02|2975.00|    20|

| 7654|MARTIN|   SALESMAN|   7698|1981-09-28|1250.00|    30|

| 7698| BLAKE|    MANAGER|   7839|1981-05-01|2850.00|    30|

| 7782| CLARK|    MANAGER|   7839|1981-06-09|2450.00|    10|

| 7788| SCOTT|    ANALYST|   7566|1982-12-09|3000.00|    20|

| 7839|  KING|  PRESIDENT|   NULL|1981-11-17|5000.00|    10|

| 7844|TURNER|   SALESMAN|   7698|1981-09-08|1500.00|    30|

| 7876| ADAMS|      CLERK|   7788|1983-01-12|1100.00|    20|

| 7900| JAMES|      CLERK|   7698|1981-12-03| 950.00|    30|

| 7902|  FORD|    ANALYST|   7566|1981-12-03|3000.00|    20|

| 7934|MILLER|      CLERK|   7782|1982-01-23|1300.00|    10|

+-----+------+-----------+-------+----------+-------+------+

df= df.withColumn("deptno",df["deptno"].cast(IntegerType()))
df= df.withColumn("sal",df["sal"].cast(IntegerType()))

df_sum = df.groupBy("deptno").agg(f.sum("sal").alias("sum"))

df_sum.show()

+------+-------+                                            

|deptno|    sum|

+------+-------+

|    20|10875.0|

|    10| 8750.0|

|    30| 9400.0|

+------+-------+

df_sum.select("deptno","sum",row_number().over(Window.orderBy(df_sum["sum"].desc())).alias("rn")).filter("rn==1").drop("rn").show()

+------+-------+                                      

|deptno|    sum|

+------+-------+

|    20|10875.0|

+------+-------+

---------------------using Department Name-------------------

df1 = spark.read.format("csv").option("header","true").load("file:///Users/furqan/data/dept.txt")

df1.show()
+------+----------+--------+
|deptno|     dname|     loc|
+------+----------+--------+
|    10|ACCOUNTING|NEW YORK|
|    20|  RESEARCH|  DALLAS|
|    30|     SALES| CHICAGO|
|    40|OPERATIONS|  BOSTON|
+------+----------+--------+

sum = df.join(df1,df['deptno']==df1['deptno'],'inner').groupBy(df1["dname"]).agg(f.sum("sal").alias("sum"))

+----------+-------+
|     dname|    sum|
+----------+-------+
|     SALES| 9400.0|
|  RESEARCH|10875.0|
|ACCOUNTING| 8750.0|
+----------+-------+

Data Engineering

Rate Department based on Max Salary using Pyspark

Rate Department Based on Maximum Salary :

No comments:

Post a Comment

Popular

Tags

Pages