아래 링크를 따라 해 보았습니다.
>> medium.com/@gscheithauer/process-mining-in-10-minutes-with-r-1ab28ed74e81

 

1. 들어가기 전에

프로세스 마이닝은 프로세스 분석을 하는 것이고 워크숍, 인터뷰, 과거 문서들이 아닌 비즈니스 시스템에서 도출된 데이터를 기반으로 프로세스 마이닝을 적용시킬 수도 있죠! 프로세스 모델을 생성하고 이 모델을 활용해 프로세스 준수 문제를 발견할 수도 있습니다. 현재 프로세스 마이닝을 할 수 있는 많은 도구가 있지만 그중 하나인 오픈 소스 도구인 buparR(여러 R 패키지로 구성됨)을 이용합니다.

2. 데이터 소개 및 준비하기

이 글에서는 BPI Challenge 2017 [5]에서 제공 한 은행 신용 응용 프로그램 프로세스의 실제 데이터(익명)를 사용합니다. 분석 전 아래 깃 레포지토리를 다운로드합니다.
>> Clone GitHub project: https://github.com/scheithauer/processmining-bupaR

 

scheithauer/processmining-bupaR

Process Mining with bupaR. Contribute to scheithauer/processmining-bupaR development by creating an account on GitHub.

github.com

3. 분석

분석이지만 남의 코드 뜯어보기!! (다운로드한 레포에서 01-scripts 폴더의 00_pm_bupar_MAIN.R을 열어주세요!)

3.1 패키지 설치하기

이 분은 벡터 표시 (c)를 앞에 붙여주는군요! 추가적으로 namespace 'htmltools' 0.4.0 is being loaded, but >= 0.4.0.9003 is required 에러가 난다면.. 업데이트를 해도 똑같다면 그냥.. R을 지우고 새로 설치하세요.. 4 버전으로... 그럼 해결이 됩니다.. 

# check installed packages and install only necessary ones #### 
c_necessary_packages <- c( 
  'bupaR', 
  'edeaR', 
  'processmapR', 
  'eventdataR', 
  'readr', 
  'tidyverse', 
  'DiagrammeR', 
  'ggplot2', 
  'stringr', 
  'lubridate'   
) 
c_missing_packages <- c_necessary_packages[!(c_necessary_packages %in% installed.packages()[,"Package"])] 
if(length(c_missing_packages) > 0) install.packages(c_missing_packages)

3.2 패키지 불러오기

자 패키지를 설치했으면 설치한 패키지들을 불러야겠죠?

library(bupaR)
library(edeaR)
library(processmapR)
library(eventdataR)
library(readr)
library(tidyverse)
library(DiagrammeR)
library(ggplot2)
library(stringr)
library(lubridate)

3.3 데이터 불러오기

자 이제 데이터들도 읽어봅시다.

# load BPI Challenge 2017 data set ####
data <- readr::read_csv('./00-data/loanapplicationfile.csv',
                         locale = locale(date_names = 'en',
                                         encoding = 'ISO-8859-1'))

# change timestamp to date var
data$starttimestamp = as.POSIXct(data$`Start_Timestamp`, 
                                 format = "%Y/%m/%d %H:%M:%S")

data$endtimestamp = as.POSIXct(data$`Complete_Timestamp`, 
                               format = "%Y/%m/%d %H:%M:%S")

# remove blanks from var names
names(data) <- str_replace_all(names(data), c(" " = "_" , "," = "" ))

3.4 이벤트 로그로 변경하기

프로세스 마이닝을 적용하기 위해 데이터를 이벤트 로그로 변환합니다.

# transform data into eventlog
events <- bupaR::activities_to_eventlog(
  data,
  case_id = 'Case_ID',
  activity_id = 'Activity',
  resource_id = 'Resource',
  timestamps = c('starttimestamp', 'endtimestamp')
)

참고: 이벤트 로그 변환 관련 bupaR

3.5 데이터 탐색하기

# statistics eventlog ####
events %>% 
  summary

events %>% 
  activity_frequency(level = "activity") 

events %>% 
  activity_frequency(level = "activity") %>% 
  plot()


# filter all cases where one specific activity was present
events %>% 
  filter_activity_presence(activities = c('A_Cancelled')) %>% 
  activity_frequency(level = "activity") 

 

3.6 프로세스 마이닝

# process map ####
events %>%
  filter_activity_frequency(percentage = 1.0) %>% # show only most frequent activities
  filter_trace_frequency(percentage = .80) %>%    # show only the most frequent traces
  process_map(render = F) %>% 
  export_graph(file_name = './02-output/01_pm-bupar_process map.png',
               file_type = 'PNG')


# process map - performance ####
events %>%
  filter_activity_frequency(percentage = 1.0) %>% # show only most frequent activities
  filter_trace_frequency(percentage = .80) %>%    # show only the most frequent traces
  process_map(performance(mean, "mins"),
              render = F) %>% 
  export_graph(file_name = './02-output/02_pm-bupar_process map performance.png',
               file_type = 'PNG')


# precedent matrix ####
precedence_matrix <- events %>%
  filter_activity_frequency(percentage = 1.0) %>% # show only most frequent activities
  filter_trace_frequency(percentage = .80) %>%    # show only the most frequent traces
  precedence_matrix() %>% 
  plot()

ggsave('./02-output/03_pm-bupar_process precedence matrix.png', precedence_matrix)
rm(precedence_matrix)


# trace explorer
trace_explorer <- events %>%
  trace_explorer(coverage = 0.5)

ggsave('./02-output/04_pm-bupar_trace explorer.png', trace_explorer, width = 12)
rm(trace_explorer)

# idotted chart
chart <- events %>%
  dotted_chart()

chart

# resource map ####
events %>%
  filter_activity_frequency(percentage = .1) %>% # show only most frequent resources
  filter_trace_frequency(percentage = .8) %>%    # show only the most frequent traces
  resource_map(render = F) %>% 
  export_graph(file_name = './02-output/05_pm-bupar_resource map.png',
               file_type = 'PNG')


# resource matrix ####
resource_matrix <- events %>%
  filter_activity_frequency(percentage = 1.0) %>% # show only most frequent activities
  filter_trace_frequency(percentage = .80) %>%    # show only the most frequent traces
  resource_matrix() %>% 
  plot()

ggsave('./02-output/06_pm-bupar_resource matrix.png', resource_matrix)
rm(resource_matrix)


# process map where one activity was at least once present ####
events %>%
  filter_activity_presence(activities = c('A_Cancelled')) %>% 
  filter_activity_frequency(percentage = 1.0) %>% # show only most frequent activities
  filter_trace_frequency(percentage = .80) %>%    # show only the most frequent traces
  process_map(render = F) %>% 
  export_graph(file_name = './02-output/07_pm-bupar_process map cancelled.png',
               file_type = 'PNG')


# process map where one activity was at least once present in Feb 2016 ####
events %>%
  filter_time_period(interval = c(ymd(20160101), end_point = ymd(20160102)),
                     filter_method = 'start') %>% 
  filter_activity_presence(activities = c('A_Cancelled')) %>% 
  filter_activity_frequency(percentage = 1.0) %>% # show only most frequent activities
  filter_trace_frequency(percentage = .80) %>%    # show only the most frequent traces
  process_map(render = F) %>% 
  export_graph(file_name = './02-output/08_pm-bupar_process map cancelled time intervall.png',
               file_type = 'PNG')


# Conditional Process Analysis ####
events %>%
  filter_activity_frequency(percentage = 1.0) %>% # show only most frequent activities
  filter_trace_frequency(percentage = .80) %>%    # show only the most frequent traces
  throughput_time('log', units = 'hours')

events %>%
  filter_activity_frequency(percentage = 1.0) %>% # show only most frequent activities
  filter_trace_frequency(percentage = .80) %>%    # show only the most frequent traces
  throughput_time('case', units = 'hours')

  
events %>%
  filter_activity_frequency(percentage = 1.0) %>% # show only most frequent activities
  filter_trace_frequency(percentage = .80) %>%    # show only the most frequent traces
  group_by(`(case)_ApplicationType`) %>% 
  throughput_time('log', units = 'hours')
  
plot <- events %>%
  filter_activity_frequency(percentage = 1.0) %>% # show only most frequent activities
  filter_trace_frequency(percentage = .80) %>%    # show only the most frequent traces
  group_by(`(case)_ApplicationType`) %>% 
  throughput_time('log', units = 'hours') %>% 
  plot()

plot

ggsave('./02-output/08_pm-bupar_throughput application type.png', plot)
rm(plot)

events %>%
  filter_activity_frequency(percentage = 1.0) %>% # show only most frequent activities
  filter_trace_frequency(percentage = .80) %>%    # show only the most frequent traces
  group_by(`(case)_LoanGoal`) %>% 
  throughput_time('log', units = 'hours')

plot <- events %>%
  filter_activity_frequency(percentage = 1.0) %>% # show only most frequent activities
  filter_trace_frequency(percentage = .80) %>%    # show only the most frequent traces
  group_by(`(case)_LoanGoal`) %>% 
  throughput_time('log', units = 'hours') %>% 
  plot()

plot

ggsave('./02-output/09_pm-bupar_throughput loan goal.png', plot)
rm(plot)

 

음.. 따라 해 보았지만 여전히 뭔 말인지 모르겠다. 다음부터는 bupaR 함수 하나하나 봐야겠다.

+ Recent posts