Data-Mining-51Job

Data-mining on 51Job website

This project is maintained by wangjksjtu

Data-Mining-51Job

This repository is established to explore the data on 51Job website, where a number of companies poster their wanted positions, and at the same time employees could share their own profiles to boost their career development. Overall, the work in this repo could be summarized in following aspects:

The documents of our work are available here: [report], [notebook].

Requirements

QuickStart

Web Crawling

We use scrapy to crawl raw data from 51Job website. See the directory /job51spider for codes. XPath is used to parse the html and extract data information.

After entering the directory, input the command in cmd.exe to run the spider.

scrapy crawl 51job

Data Preprocessing

We use python libraries pandas (using class dataframe) and re to preprocess the raw data. See /preprocess/preprocess.py for code. You can find the preprocessed data in /data, where middleData.csv is the preprocessed data suitable for drawing pics, while quantityData.csv quantifies all data and fits further data analysis.

Feature Engineering

See directory /pics. We analyzed feature coorelation and feature distribution respectively. We found two some main features which affect salary level: education level requirements, work experience requirements and area location.

map_salary

Salary Prediction

Model R2 value Mean Error ¥/year time / s
Ada-Boost 0.2350 37483.9 0.79
Grad-Boost 0.3237 34031.4 3.13
SVR (RBF) 0.0092 43301.9 350.08
Bayesian Ridge 0.2667 34031.4 0.05
Elastic Net 0.0426 44784.4 0.03
MLPs 0.2682 36207.3 19.29

Job Area Prediction

Model Accuracy / % time / s Model Accuracy / % time / s
LP 7.79% 3.89 MLP 20.91% 20.05
GNB 7.32% 0.23 SVM 29.31% 1032.75
KNN 25.19% 2.60 XGBoost 27.53% 303.21
RF 28.44% 1.80      

The accuracy & time plot of the above models:

Team Members