아래 링크를 따라 해 보았습니다.
>> medium.com/@gscheithauer/process-mining-in-10-minutes-with-r-1ab28ed74e81
1. 들어가기 전에
프로세스 마이닝은 프로세스 분석을 하는 것이고 워크숍, 인터뷰, 과거 문서들이 아닌 비즈니스 시스템에서 도출된 데이터를 기반으로 프로세스 마이닝을 적용시킬 수도 있죠! 프로세스 모델을 생성하고 이 모델을 활용해 프로세스 준수 문제를 발견할 수도 있습니다. 현재 프로세스 마이닝을 할 수 있는 많은 도구가 있지만 그중 하나인 오픈 소스 도구인 buparR(여러 R 패키지로 구성됨)을 이용합니다.
2. 데이터 소개 및 준비하기
이 글에서는 BPI Challenge 2017 [5]에서 제공 한 은행 신용 응용 프로그램 프로세스의 실제 데이터(익명)를 사용합니다. 분석 전 아래 깃 레포지토리를 다운로드합니다.
>> Clone GitHub project: https://github.com/scheithauer/processmining-bupaR
3. 분석
분석이지만 남의 코드 뜯어보기!! (다운로드한 레포에서 01-scripts 폴더의 00_pm_bupar_MAIN.R을 열어주세요!)
3.1 패키지 설치하기
이 분은 벡터 표시 (c)를 앞에 붙여주는군요! 추가적으로 namespace 'htmltools' 0.4.0 is being loaded, but >= 0.4.0.9003 is required 에러가 난다면.. 업데이트를 해도 똑같다면 그냥.. R을 지우고 새로 설치하세요.. 4 버전으로... 그럼 해결이 됩니다..
# check installed packages and install only necessary ones ####
c_necessary_packages <- c(
'bupaR',
'edeaR',
'processmapR',
'eventdataR',
'readr',
'tidyverse',
'DiagrammeR',
'ggplot2',
'stringr',
'lubridate'
)
c_missing_packages <- c_necessary_packages[!(c_necessary_packages %in% installed.packages()[,"Package"])]
if(length(c_missing_packages) > 0) install.packages(c_missing_packages)
3.2 패키지 불러오기
자 패키지를 설치했으면 설치한 패키지들을 불러야겠죠?
library(bupaR)
library(edeaR)
library(processmapR)
library(eventdataR)
library(readr)
library(tidyverse)
library(DiagrammeR)
library(ggplot2)
library(stringr)
library(lubridate)
3.3 데이터 불러오기
자 이제 데이터들도 읽어봅시다.
# load BPI Challenge 2017 data set ####
data <- readr::read_csv('./00-data/loanapplicationfile.csv',
locale = locale(date_names = 'en',
encoding = 'ISO-8859-1'))
# change timestamp to date var
data$starttimestamp = as.POSIXct(data$`Start_Timestamp`,
format = "%Y/%m/%d %H:%M:%S")
data$endtimestamp = as.POSIXct(data$`Complete_Timestamp`,
format = "%Y/%m/%d %H:%M:%S")
# remove blanks from var names
names(data) <- str_replace_all(names(data), c(" " = "_" , "," = "" ))
3.4 이벤트 로그로 변경하기
프로세스 마이닝을 적용하기 위해 데이터를 이벤트 로그로 변환합니다.
# transform data into eventlog
events <- bupaR::activities_to_eventlog(
data,
case_id = 'Case_ID',
activity_id = 'Activity',
resource_id = 'Resource',
timestamps = c('starttimestamp', 'endtimestamp')
)
3.5 데이터 탐색하기
# statistics eventlog ####
events %>%
summary
events %>%
activity_frequency(level = "activity")
events %>%
activity_frequency(level = "activity") %>%
plot()
# filter all cases where one specific activity was present
events %>%
filter_activity_presence(activities = c('A_Cancelled')) %>%
activity_frequency(level = "activity")
3.6 프로세스 마이닝
# process map ####
events %>%
filter_activity_frequency(percentage = 1.0) %>% # show only most frequent activities
filter_trace_frequency(percentage = .80) %>% # show only the most frequent traces
process_map(render = F) %>%
export_graph(file_name = './02-output/01_pm-bupar_process map.png',
file_type = 'PNG')
# process map - performance ####
events %>%
filter_activity_frequency(percentage = 1.0) %>% # show only most frequent activities
filter_trace_frequency(percentage = .80) %>% # show only the most frequent traces
process_map(performance(mean, "mins"),
render = F) %>%
export_graph(file_name = './02-output/02_pm-bupar_process map performance.png',
file_type = 'PNG')
# precedent matrix ####
precedence_matrix <- events %>%
filter_activity_frequency(percentage = 1.0) %>% # show only most frequent activities
filter_trace_frequency(percentage = .80) %>% # show only the most frequent traces
precedence_matrix() %>%
plot()
ggsave('./02-output/03_pm-bupar_process precedence matrix.png', precedence_matrix)
rm(precedence_matrix)
# trace explorer
trace_explorer <- events %>%
trace_explorer(coverage = 0.5)
ggsave('./02-output/04_pm-bupar_trace explorer.png', trace_explorer, width = 12)
rm(trace_explorer)
# idotted chart
chart <- events %>%
dotted_chart()
chart
# resource map ####
events %>%
filter_activity_frequency(percentage = .1) %>% # show only most frequent resources
filter_trace_frequency(percentage = .8) %>% # show only the most frequent traces
resource_map(render = F) %>%
export_graph(file_name = './02-output/05_pm-bupar_resource map.png',
file_type = 'PNG')
# resource matrix ####
resource_matrix <- events %>%
filter_activity_frequency(percentage = 1.0) %>% # show only most frequent activities
filter_trace_frequency(percentage = .80) %>% # show only the most frequent traces
resource_matrix() %>%
plot()
ggsave('./02-output/06_pm-bupar_resource matrix.png', resource_matrix)
rm(resource_matrix)
# process map where one activity was at least once present ####
events %>%
filter_activity_presence(activities = c('A_Cancelled')) %>%
filter_activity_frequency(percentage = 1.0) %>% # show only most frequent activities
filter_trace_frequency(percentage = .80) %>% # show only the most frequent traces
process_map(render = F) %>%
export_graph(file_name = './02-output/07_pm-bupar_process map cancelled.png',
file_type = 'PNG')
# process map where one activity was at least once present in Feb 2016 ####
events %>%
filter_time_period(interval = c(ymd(20160101), end_point = ymd(20160102)),
filter_method = 'start') %>%
filter_activity_presence(activities = c('A_Cancelled')) %>%
filter_activity_frequency(percentage = 1.0) %>% # show only most frequent activities
filter_trace_frequency(percentage = .80) %>% # show only the most frequent traces
process_map(render = F) %>%
export_graph(file_name = './02-output/08_pm-bupar_process map cancelled time intervall.png',
file_type = 'PNG')
# Conditional Process Analysis ####
events %>%
filter_activity_frequency(percentage = 1.0) %>% # show only most frequent activities
filter_trace_frequency(percentage = .80) %>% # show only the most frequent traces
throughput_time('log', units = 'hours')
events %>%
filter_activity_frequency(percentage = 1.0) %>% # show only most frequent activities
filter_trace_frequency(percentage = .80) %>% # show only the most frequent traces
throughput_time('case', units = 'hours')
events %>%
filter_activity_frequency(percentage = 1.0) %>% # show only most frequent activities
filter_trace_frequency(percentage = .80) %>% # show only the most frequent traces
group_by(`(case)_ApplicationType`) %>%
throughput_time('log', units = 'hours')
plot <- events %>%
filter_activity_frequency(percentage = 1.0) %>% # show only most frequent activities
filter_trace_frequency(percentage = .80) %>% # show only the most frequent traces
group_by(`(case)_ApplicationType`) %>%
throughput_time('log', units = 'hours') %>%
plot()
plot
ggsave('./02-output/08_pm-bupar_throughput application type.png', plot)
rm(plot)
events %>%
filter_activity_frequency(percentage = 1.0) %>% # show only most frequent activities
filter_trace_frequency(percentage = .80) %>% # show only the most frequent traces
group_by(`(case)_LoanGoal`) %>%
throughput_time('log', units = 'hours')
plot <- events %>%
filter_activity_frequency(percentage = 1.0) %>% # show only most frequent activities
filter_trace_frequency(percentage = .80) %>% # show only the most frequent traces
group_by(`(case)_LoanGoal`) %>%
throughput_time('log', units = 'hours') %>%
plot()
plot
ggsave('./02-output/09_pm-bupar_throughput loan goal.png', plot)
rm(plot)
음.. 따라 해 보았지만 여전히 뭔 말인지 모르겠다. 다음부터는 bupaR 함수 하나하나 봐야겠다.