步驟 2：探索資料

建構和訓練模型只是工作流程的一部分。瞭解事先瞭解資料的特性模型這可能只是意味著能獲得更高的準確度。也可能是訓練所需的資料量較少，或是運算資源較少。

載入資料集

首先，將資料集載入 Python。

def load_imdb_sentiment_analysis_dataset(data_path, seed=123):
    """Loads the IMDb movie reviews sentiment analysis dataset.

    # Arguments
        data_path: string, path to the data directory.
        seed: int, seed for randomizer.

    # Returns
        A tuple of training and validation data.
        Number of training samples: 25000
        Number of test samples: 25000
        Number of categories: 2 (0 - negative, 1 - positive)

    # References
        Mass et al., http://www.aclweb.org/anthology/P11-1015

        Download and uncompress archive from:
        http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
    """
    imdb_data_path = os.path.join(data_path, 'aclImdb')

    # Load the training data
    train_texts = []
    train_labels = []
    for category in ['pos', 'neg']:
        train_path = os.path.join(imdb_data_path, 'train', category)
        for fname in sorted(os.listdir(train_path)):
            if fname.endswith('.txt'):
                with open(os.path.join(train_path, fname)) as f:
                    train_texts.append(f.read())
                train_labels.append(0 if category == 'neg' else 1)

    # Load the validation data.
    test_texts = []
    test_labels = []
    for category in ['pos', 'neg']:
        test_path = os.path.join(imdb_data_path, 'test', category)
        for fname in sorted(os.listdir(test_path)):
            if fname.endswith('.txt'):
                with open(os.path.join(test_path, fname)) as f:
                    test_texts.append(f.read())
                test_labels.append(0 if category == 'neg' else 1)

    # Shuffle the training data and labels.
    random.seed(seed)
    random.shuffle(train_texts)
    random.seed(seed)
    random.shuffle(train_labels)

    return ((train_texts, np.array(train_labels)),
            (test_texts, np.array(test_labels)))

檢查資料

載入資料後，建議您對資料執行檢查：挑選並手動檢查這些樣本是否符合您的期望。舉例來說，您可以輸出一些隨機樣本，看看是否帶有情緒標籤符合評論的心情以下是我們隨機挑選的評論根據 IMDb 資料集搜尋 「長達十分鐘的故事，隨後被延伸至影片發布量在兩個小時內當只剩下一半。」與預期情緒 (負面) 相符樣本標籤

收集重要指標

驗證資料後，即可收集下列可說明文字分類問題的具體特徵：

樣本數量：資料中的樣本總數。
類別數量：資料中的主題或類別總數。
每個類別的樣本數量：每個類別的樣本數 (主題/類別)。在平衡的資料集中，所有類別都會有類似的數字樣本；而每個類別的樣本數量差異很大。
每個樣本的字數：樣本的字數中位數。
字詞頻率分佈：顯示頻率的分佈情形 (出現次數)。
樣本長度分佈：顯示字詞數量的分佈情形擷取每個樣本的流量

接著來看看 IMDb 評論資料集的這些指標值 (請參閱圖 3 和 4，瞭解字詞頻率和取樣長度的圖發行版)。

指標名稱	指標值
樣本數	25000
類別數量	2
每個類別的樣本數量	12500
每個樣本的字詞數	174

表 1：IMDb 查看資料集指標

explore_data.py 包含函式計算及分析這些指標以下列舉幾個範例：

import numpy as np
import matplotlib.pyplot as plt

def get_num_words_per_sample(sample_texts):
    """Returns the median number of words per sample given corpus.

    # Arguments
        sample_texts: list, sample texts.

    # Returns
        int, median number of words per sample.
    """
    num_words = [len(s.split()) for s in sample_texts]
    return np.median(num_words)

def plot_sample_length_distribution(sample_texts):
    """Plots the sample length distribution.

    # Arguments
        samples_texts: list, sample texts.
    """
    plt.hist([len(s) for s in sample_texts], 50)
    plt.xlabel('Length of a sample')
    plt.ylabel('Number of samples')
    plt.title('Sample length distribution')
    plt.show()

IMDb 字詞的頻率分佈

圖 3：IMDb 字詞的頻率分佈

IMDb 樣本長度分佈情形

圖 4：IMDb 樣本長度分佈情形

步驟 1：收集資料

步驟 2.5：選擇模型

步驟 2：探索資料 透過集合功能整理內容 你可以依據偏好儲存及分類內容。

載入資料集

檢查資料

收集重要指標

步驟 2：探索資料