Figure 1. Version 2-based system experiments.


As the source of demand that drives the supply system of an enterprise, forecasting the demand for goods at upstream and downstream nodes is the basis for all members of the supply chain to make business decisions such as ordering strategies, inventory management, and distribution planning. With the rapid development of the Internet economy and the further integration of industry and the Internet, customers' purchasing behavior has become more strategic, which makes the uncertainty of demand increase and greatly increases the difficulty of predicting customer demand. At the same time, the agility of the supply chain of enterprises has also raised higher requirements. Therefore, how to achieve a more accurate prediction of customer purchases is of great practical significance for the rational formulation of supply chain operations and management strategies.


Specifically, this work uses a hybrid ARIMA-LSTM model to forecast the demand for popular commodity types on a real dataset. The results are compared experimentally with SARIMA, and LSTM models respectively. In this process, a detailed visual analysis of the used dataset is presented, and the whole process from data processing to experimental validation analysis is described in detail. The research on supply chain uncertainty factors is also supplemented with a study of the types and sources of market factors that need to be considered in supply chain demand forecasting from different perspectives, the many problems found in the process of supply chain demand forecasting, and the methods of prevention.


Figure 2. ARIMA-LSTM Model Structure.

This work creates a fusion model based on ARIMA model and LSTM model with better performance on time series data. The data used came from the kaggle competition topic Predict Future Sales, specifically using four of the datasets named sales_train.csv, items.csv, items_categories.csv and shops.csv.

Data Preprocessing

Firstly, the fields of the sample data table are introduced to have a preliminary understanding of the data volume and fields of the data table; secondly, the data are subjected to exploratory data analysis, which is a data analysis method to explore the structure and laws of the data by means of statistics, graphing and tabulation, etc. It can dig into the hidden relationship information contained in the data and find out the abnormal values, missing values and other abnormalities of the data.

In practical engineering cases, time series reflects the patterns, anomalies and trends of data changes over time. The single time series itself is time-sensitive, and the data format and value standard have great changes. In addition, there are also missing, abnormal and inconsistent situations, which in turn affect the accuracy of the prediction results and the accuracy of the model. Therefore, it is very important to scientifically preprocess time series data before performing time series analysis. The main tool used for data import, analysis and processing in this section is pandas, which is a Numpy-based toolkit in python. The library contains a large number of standard data models, which can efficiently process and analyze big data, and pandas The DataFrame is a two-dimensional table data structure that conforms to the structure of the dataset in this paper. For data visualization, the visualization libraries matplotlib and seaborn in Python are mainly used for data graphics.

Feature Expanding and Merging

Since most of the variables we have in the actual business have no actual meaning and are not suitable for direct modeling, such variables often have strong information value after doing certain transformations or combinations, which can be helpful for data sensitivity and machine learning. By observing the source dataset, we can find that the shop_name in shops.csv is composed of city name plus store type, so we can get the new feature city_name by splitting the store name, but since the city name is composed of complex text, it is not suitable for machine learning, here use LabelEncoder() to encode the feature as city\_code. Finally, all the data sets are merged and the new dataset is stored as daily\_data.csv, which contains the data attributes as shown in the following table.

Figure 2. daily data.csv Property List.

Missing Value Handling

We fill the [item_cnt_day] and [item_price] columns with the average value of sales and the average value of price for the month of the missing date, respectively, and select the data in the column above the missing value for the other features. However, due to a large amount of missing data for November 2015, the data for that month will not be used in the actual case. After the missing values are added, the data of ID40 and ID30 are aggregated according to the item category ID and date, where item_price is aggregated by average value, item_cnt_day is aggregated by total value, and the remaining feature values are aggregated by the plural.

Outliers Handling

Outliers refer to sample points where some values in the sample deviate significantly from the rest, so they are also called outliers, often such data points exhibit unreasonable properties in the dataset. If these outliers are ignored, it will lead to deviations in the conclusions in some modeling scenarios, so in the process of data exploration, it is necessary to identify these outliers and deal with them properly.

Figure 2. Outliers by ID40 Sales Volume and Item Price.

ARIMA-LSTM Model Desgin

Many time series models contain both linear and nonlinear relationships. While ARIMA and SARIM models are good at modeling linear relationships in time series, they are inadequate for modeling nonlinear relationships. The LSTM model can model both linear and nonlinear relationships, but cannot provide the same results for every data set. Therefore, in order to obtain the best prediction results, hybrid models based on modeling the linear and nonlinear components of the time series separately were used. These models have achieved better results in predicting time series analysis. The use of multiple learning algorithms results in better estimation performance compared to constructive learning algorithms. These hybrid models are based on supervised machine learning algorithms, so they can be used for both training and forecasting purposes. In addition, hybrid models increase model diversity and can achieve better prediction results.

The results obtained by using the hybrid model and those obtained by using the model alone, even if they are not correlated with each other, can be observed to reduce the general variance or error. Therefore, mixed models are considered to be the most successful models for prediction tasks. To make predictions, a number of hybrid models consisting of linear and nonlinear models have been used by different researchers. In this paper, by using historical data from time series for future forecasting, the linear fitting capability of ARIMA is first used to learn the fit for the linear features of the sales data of the commodity type, and the error series is calculated based on the predicted values obtained from the ARIMA model, followed by the estimation of the obtained error series and the original series with the features through the LSTM model to obtain the final results. The non-linear fitting ability of the LSTM model to the sequences is used to make a small correction to the prediction results of the ARIMA model. The flowchart of the method is shown in Figure 5.4.


Demand Forecasting, Time Series Forecasting, Machine Learning, ARIMA, SARIMA, LSTM.



We evaluated the SARIMA model, the LSTM model, and the SARIMA-LSTM fusion model in turn (below, from top to bottom), and the results show that the fusion model performs better, (see right side for specific parameters).

Although the hybrid model proposed in this paper has improved in accuracy compared with SARIMA and LSTM, the improvement is relatively small and there are still some limitations in demand forecasting, so we can consider increasing the complexity of the model for forecasting analysis in the future.

The current demand forecasting algorithm can only accomplish short-term forecasting with a small time scale. Therefore, the next step will be to consider a deeper exploration using other machine learning models and increasing the feature dimensionality.

Figure 2. Comparison of the three model predictions.

Figure 2. Comparison of the three model predictions.

Figure 2. Comparison of the three model predictions.


This paper uses data from the Kaggle competition Predict Future Sales, using four datasets named sales_train.csv, items.csv, items_categories.csv and shops.csv, sales_train.csv contains daily historical data from January 2013 to October 2015, items.csv contains additional information about items, items_categories.csv contains additional information about item categories, shops.csv contains additional information about shops.

Figure 2. Dataset Property List.

The shape of sales_train.csv has 6 columns and 2935849 rows. items.csv has 3 columns and 22170 rows. items_categories.csv contains 2 columns and 84 rows. shops.csv has 2 columns and 60 rows. The shape of each dataset shows that there are 22,170 different items which can be grouped into 84 categories, and 60 shops counted in the dataset. The table below shows the column names contained in each of the four datasets.

Figure 2. Dataset Property List.

Since the main purpose of this paper is to forecast the sales of hot item categories, it is necessary to have an understanding of the popularity of the item categories before pre-processing the data. As shown in Figure 6.2, the sales of different item categories are aggregated, and it can be seen that the item category with item_category_id of 40 has the highest sales, followed by the item with item_category_id of 30. For the sake of brevity, ID40 and ID30 will be used directly in subsequent papers to refer to the item categories for which sales forecasts are to be made.

Figure 2. op Item Categories Sales.