Глоссарий по машинному обучению

Оптимизируйте свои подборки Сохраняйте и классифицируйте контент в соответствии со своими настройками.

Этот глоссарий определяет общие термины машинного обучения, а также термины, характерные для TensorFlow.

А

A/B-тестирование

Статистический способ сравнения двух (или более) техник — «А» и «В» — обычно действующих против нового соперника. A/B-тестирование направлено не только на то, чтобы определить, какая техника работает лучше, но и на то, чтобы понять, является ли разница статистически значимой. A/B-тестирование обычно рассматривает только два метода с использованием одного измерения, но его можно применять к любому конечному числу методов и измерений.

точность

Доля предсказаний , которые модель классификации оправдала. В многоклассовой классификации точность определяется следующим образом:

$$\text{Accuracy} = \frac{\text{Correct Predictions}} {\text{Total Number Of Examples}}$$

В бинарной классификации точность имеет следующее определение:

$$\text{Accuracy} = \frac{\text{True Positives} + \text{True Negatives}} {\text{Total Number Of Examples}}$$

См. истинное положительное и истинное отрицательное . Сравните точность с точностью и отзывом .

действие

#рл

В обучении с подкреплением - механизм, с помощью которого агент переходит между состояниями среды . Агент выбирает действие с помощью политики .

функция активации

Функция (например, ReLU или sigmoid ), которая принимает взвешенную сумму всех входных данных из предыдущего уровня, а затем генерирует и передает выходное значение (обычно нелинейное) на следующий уровень.

активное изучение

Подход к обучению , при котором алгоритм выбирает некоторые данные, на которых он учится. Активное обучение особенно ценно, когда помеченных примеров мало или они дороги. Вместо слепого поиска разнообразных помеченных примеров алгоритм активного обучения выборочно ищет конкретный набор примеров, необходимых для обучения.

АдаГрад

Сложный алгоритм градиентного спуска, который перемасштабирует градиенты каждого параметра , эффективно предоставляя каждому параметру независимую скорость обучения . Полное объяснение см. в этой статье .

агент

#рл

В обучении с подкреплением объект, который использует политику для максимизации ожидаемой отдачи от перехода между состояниями среды .

агломеративная кластеризация

#кластеризация

См. иерархическая кластеризация .

обнаружение аномалий

Процесс выявления выбросов . Например, если среднее значение для определенного признака равно 100 со стандартным отклонением 10, то обнаружение аномалии должно помечать значение 200 как подозрительное.

дополненная реальность

Аббревиатура дополненной реальности .

площадь под кривой PR

См. PR AUC (площадь под кривой PR) .

площадь под кривой ROC

См. AUC (площадь под кривой ROC) .

искусственный общий интеллект

Нечеловеческий механизм, который демонстрирует широкий спектр решения проблем, креативность и адаптивность. Например, программа, демонстрирующая общий искусственный интеллект, может переводить текст, сочинять симфонии и преуспевать в играх, которые еще не изобретены.

искусственный интеллект

Нечеловеческая программа или модель , способная решать сложные задачи. Например, программа или модель, которая переводит текст, или программа или модель, которая идентифицирует заболевания по рентгенологическим изображениям, обладают искусственным интеллектом.

Формально машинное обучение является частью искусственного интеллекта. Однако в последние годы некоторые организации начали использовать термины « искусственный интеллект » и « машинное обучение» как синонимы.

внимание

#язык

Любой из широкого спектра механизмов архитектуры нейронной сети , которые собирают информацию из набора входных данных в зависимости от данных. Типичный механизм внимания может состоять из взвешенной суммы по набору входных данных, где вес для каждого входного сигнала вычисляется другой частью нейронной сети.

Обратитесь также к самовниманию и многоголовому самовниманию , которые являются строительными блоками Трансформеров .

атрибут

#справедливость

Синоним характеристики . Справедливости ради, атрибуты часто относятся к характеристикам, относящимся к отдельным лицам.

выборка атрибутов

#дф

Тактика обучения леса решений , в котором каждое дерево решений рассматривает только случайное подмножество возможных признаков при изучении условия . Как правило, для каждого узла выбирается разное подмножество признаков. Напротив, при обучении дерева решений без выборки атрибутов для каждого узла рассматриваются все возможные признаки.

AUC (площадь под кривой ROC)

Метрика оценки, учитывающая все возможные пороги классификации .

Площадь под кривой ROC — это вероятность того, что классификатор будет более уверен в том, что случайно выбранный положительный пример действительно положительный, чем в том, что случайно выбранный отрицательный пример является положительным.

дополненная реальность

#изображение

Технология, которая накладывает созданное компьютером изображение на представление пользователя о реальном мире, таким образом обеспечивая составное представление.

предвзятость автоматизации

#справедливость

Когда человек, принимающий решения, отдает предпочтение рекомендациям, сделанным автоматизированной системой принятия решений, а не информации, полученной без автоматизации, даже если автоматизированная система принятия решений допускает ошибки.

средняя точность

Метрика для суммирования производительности ранжированной последовательности результатов. Средняя точность рассчитывается путем получения среднего значения точности для каждого соответствующего результата (каждого результата в ранжированном списке, где отзыв увеличивается по сравнению с предыдущим результатом).

См. также Область под кривой PR .

условие выравнивания по осям

#дф

В дереве решений - условие , включающее только один признак . Например, если площадь является объектом, то условие выравнивания по осям выполняется следующим образом:

area > 200

Контраст с наклонным условием .

Б

обратное распространение

Основной алгоритм выполнения градиентного спуска на нейронных сетях . Во-первых, выходные значения каждого узла вычисляются (и кэшируются) при прямом проходе. Затем частная производная ошибки по каждому параметру вычисляется при обратном проходе по графику.

расфасовка

#дф

Метод обучения ансамбля , в котором каждая составляющая модель обучается на случайном подмножестве обучающих примеров, выбранных с заменой . Например, случайный лес — это набор деревьев решений, обученных с помощью бэггинга.

Термин бэггинг является сокращением от bootstrap gging regating .

мешок слов

#язык

Представление слов во фразе или отрывке, независимо от порядка. Например, набор слов представляет собой следующие три фразы одинаково:

  • собака прыгает
  • прыгает на собаку
  • собака прыгает

Каждое слово сопоставляется с индексом в разреженном векторе , где вектор имеет индекс для каждого слова в словаре. Например, фраза the dog jumps отображается в вектор признаков с ненулевыми значениями в трех индексах, соответствующих словам the , dog и jumps . Ненулевое значение может быть любым из следующих:

  • 1 для обозначения наличия слова.
  • Подсчет количества раз, когда слово появляется в сумке. Например, если фраза «бордовая собака» — это собака с темно-бордовой шерстью , то и « бордовая собака», и « собака » будут представлены как 2, а другие слова — как 1.
  • Некоторое другое значение, например логарифм количества раз, когда слово появляется в сумке.

исходный уровень

Модель, используемая в качестве эталона для сравнения того, насколько хорошо работает другая модель (как правило, более сложная). Например, модель логистической регрессии может служить хорошей основой для глубокой модели .

Для конкретной проблемы базовый уровень помогает разработчикам моделей количественно определить минимальную ожидаемую производительность, которую новая модель должна обеспечить, чтобы новая модель была полезной.

партия

Набор примеров , используемых в одной итерации (то есть одном обновлении градиента ) обучения модели .

См. также размер партии .

пакетная нормализация

Нормализация ввода или вывода функций активации в скрытом слое . Пакетная нормализация может обеспечить следующие преимущества:

размер партии

Количество примеров в партии . Например, размер пакета SGD равен 1, в то время как размер пакета мини-пакета обычно составляет от 10 до 1000. Размер пакета обычно фиксируется во время обучения и вывода ; однако TensorFlow допускает динамические размеры пакетов.

Байесовская нейронная сеть

Вероятностная нейронная сеть , учитывающая неопределенность весов и выходных данных. Стандартная модель регрессии нейронной сети обычно предсказывает скалярное значение; например, модель предсказывает цену дома 853 000. Напротив, байесовская нейронная сеть предсказывает распределение значений; например, модель предсказывает цену дома в 853 000 со стандартным отклонением 67 200. Байесовская нейронная сеть полагается на теорему Байеса для расчета неопределенностей весов и прогнозов. Байесовская нейронная сеть может быть полезна, когда важно количественно оценить неопределенность, например, в моделях, связанных с фармацевтикой. Байесовские нейронные сети также могут помочь предотвратить переоснащение .

Байесовская оптимизация

Метод вероятностной регрессионной модели для оптимизации ресурсоемких целевых функций путем оптимизации суррогата, который количественно определяет неопределенность с помощью байесовского метода обучения. Поскольку байесовская оптимизация сама по себе очень затратна, ее обычно используют для оптимизации дорогостоящих в оценке задач с небольшим количеством параметров, таких как выбор гиперпараметров .

Уравнение Беллмана

#рл

При обучении с подкреплением оптимальная Q-функция удовлетворяет следующему тождеству:

\[Q(s, a) = r(s, a) + \gamma \mathbb{E}_{s'|s,a} \max_{a'} Q(s', a'))\]

Алгоритмы обучения с подкреплением применяют эту идентичность для создания Q-обучения с помощью следующего правила обновления:

\[Q(s,a) \gets Q(s,a) + \alpha \left[r(s,a) + \gamma \displaystyle\max_{\substack{a_1}} Q(s’,a’) - Q(s,a) \right] \]

Помимо обучения с подкреплением, уравнение Беллмана имеет приложения к динамическому программированию. См. статью в Википедии об уравнении Беллмана .

BERT (представление двунаправленного кодировщика от трансформаторов)

#язык

Архитектура модели для представления текста. Обученная модель BERT может действовать как часть более крупной модели для классификации текста или других задач машинного обучения.

BERT имеет следующие характеристики:

Варианты BERT включают:

  • ALBERT , что является аббревиатурой от A L ight BERT .
  • ЛаБСЭ .

Обзор BERT см. в разделе Open Sourcing BERT: современная предварительная подготовка для обработки естественного языка .

предвзятость (этика/справедливость)

#справедливость

1. Стереотипы, предрассудки или фаворитизм по отношению к некоторым вещам, людям или группам по сравнению с другими. Эти предубеждения могут повлиять на сбор и интерпретацию данных, дизайн системы и то, как пользователи взаимодействуют с системой. Формы этого типа смещения включают в себя:

2. Систематическая ошибка, вызванная процедурой выборки или отчетности. Формы этого типа смещения включают в себя:

Не путать с термином смещения в моделях машинного обучения или смещением прогнозирования .

предвзятость (математика)

Перехват или смещение от начала координат. Смещение (также известное как смещение ) обозначается как b или w 0 в моделях машинного обучения. Например, смещение — это b в следующей формуле:

$$y' = b + w_1x_1 + w_2x_2 + … w_nx_n$$

Не путать с предвзятостью в этике и справедливости или предвзятостью предсказания .

биграмма

#последовательность
#язык

N-грамма , в которой N=2.

двунаправленный

#язык

Термин, используемый для описания системы, которая оценивает текст, который предшествует целевому разделу текста и следует за ним. Напротив, однонаправленная система оценивает только текст, который предшествует целевому разделу текста.

Например, рассмотрим маскированную языковую модель , которая должна определять вероятности слов, представляющих подчеркивание в следующем вопросе:

Что с тобой _____?

Однонаправленная языковая модель должна была бы основывать свои вероятности только на контексте, обеспечиваемом словами «Что», «есть» и «тот». Напротив, двунаправленная языковая модель может также получать контекст от «с» и «вы», что может помочь модели генерировать более точные прогнозы.

двунаправленная языковая модель

#язык

Языковая модель , которая определяет вероятность того, что данная лексема присутствует в данном месте в отрывке текста на основе предшествующего и последующего текста.

бинарная классификация

Тип задачи классификации , которая выводит один из двух взаимоисключающих классов . Например, модель машинного обучения, которая оценивает сообщения электронной почты и выводит либо «спам», либо «не спам», является двоичным классификатором .

В отличие от многоклассовой классификации .

бинарное условие

#дф

Условие , имеющее только два возможных исхода, обычно да или нет. Например, следующее бинарное условие:

temperature >= 100

В отличие от небинарного состояния .

хранение

См. ведро .

BLEU (дублер двуязычной оценки)

#язык

Оценка от 0,0 до 1,0 включительно, указывающая на качество перевода между двумя языками общения (например, между английским и русским). Оценка BLEU 1,0 указывает на идеальный перевод; оценка BLEU 0,0 указывает на ужасный перевод.

повышение

Метод машинного обучения, который итеративно объединяет набор простых и не очень точных классификаторов (называемых «слабыми» классификаторами) в классификатор с высокой точностью («сильный» классификатор) путем увеличения веса примеров, которые модель в настоящее время неправильно классифицирует.

Ограничительная рамка

#изображение

На изображении координаты ( x , y ) прямоугольника вокруг интересующей области, такой как собака на изображении ниже.

Фотография собаки, сидящей на диване. Зеленая ограничивающая рамка с координатами вверху слева (275, 1271) и справа внизу (2954, 2761) описывает тело собаки.

вещание

Расширение формы операнда в матричной математической операции до размеров , совместимых с этой операцией. Например, линейная алгебра требует, чтобы два операнда в операции сложения матриц имели одинаковые размеры. Следовательно, вы не можете добавить матрицу формы (m, n) к вектору длины n. Широковещательная рассылка позволяет выполнять эту операцию путем виртуального расширения вектора длины n до матрицы формы (m,n) путем повторения одних и тех же значений в каждом столбце.

Например, учитывая следующие определения, линейная алгебра запрещает A + B, потому что A и B имеют разные измерения:

A = [[7, 10, 4],
     [13, 5, 9]]
B = [2]

Однако широковещание позволяет выполнять операцию A+B, фактически расширяя B до:

 [[2, 2, 2],
  [2, 2, 2]]

Таким образом, A+B теперь является допустимой операцией:

[[7, 10, 4],  +  [[2, 2, 2],  =  [[ 9, 12, 6],
 [13, 5, 9]]      [2, 2, 2]]      [15, 7, 11]]

Дополнительные сведения см. в следующем описании трансляции в NumPy .

ковширование

Преобразование объекта (обычно непрерывного ) в несколько бинарных объектов, называемых сегментами или контейнерами, обычно на основе диапазона значений. Например, вместо того, чтобы представлять температуру как одну непрерывную функцию с плавающей запятой, вы можете разделить диапазоны температур на отдельные ячейки. Учитывая, что данные о температуре чувствительны к десятым долям градуса, все температуры от 0,0 до 15,0 градусов могут быть помещены в одну ячейку, от 15,1 до 30,0 градусов — во вторую ячейку, а от 30,1 до 50,0 градусов — в третью ячейку.

С

калибровочный слой

Корректировка после прогноза, обычно для учета систематической ошибки прогноза . Скорректированные прогнозы и вероятности должны соответствовать распределению наблюдаемого набора меток.

поколение кандидатов

#рексистемы

Начальный набор рекомендаций, выбранный рекомендательной системой . Например, рассмотрим книжный магазин, предлагающий 100 000 наименований. Фаза генерации кандидатов создает гораздо меньший список подходящих книг для конкретного пользователя, скажем, 500. Но даже 500 книг — это слишком много, чтобы рекомендовать пользователю. Последующие, более дорогостоящие этапы рекомендательной системы (такие как подсчет очков и повторное ранжирование ) сокращают эти 500 до гораздо меньшего, но более полезного набора рекомендаций.

выборка кандидатов

Оптимизация времени обучения, при которой вероятность вычисляется для всех положительных меток с использованием, например, softmax , но только для случайной выборки отрицательных меток. Например, если у нас есть пример, помеченный как бигль и собака , выборка кандидатов вычисляет прогнозируемые вероятности и соответствующие условия потерь для выходных данных класса гончая и собака в дополнение к случайному подмножеству оставшихся классов ( кошка , леденец , забор ). Идея состоит в том, что отрицательные классы могут учиться на менее частом отрицательном подкреплении, пока положительные классы всегда получают надлежащее положительное подкрепление, и это действительно наблюдается эмпирически. Мотивацией для выборки кандидатов является выигрыш в вычислительной эффективности за счет того, что не вычисляются прогнозы для всех отрицательных значений.

категориальные данные

Функции , имеющие дискретный набор возможных значений. Например, рассмотрим категориальный признак с именем house style , который имеет дискретный набор из трех возможных значений:

  • tudor
  • ranch
  • colonial

Представляя house style в виде категориальных данных, модель может изучить отдельное влияние tudor , ranch и colonial на цену дома.

Иногда значения в дискретном наборе являются взаимоисключающими, и к данному примеру может быть применено только одно значение. Например, категориальный признак car maker , вероятно, допускает только одно значение ( Toyota ) для каждого примера. В других случаях может быть применимо более одного значения. Отдельный дом может быть окрашен более чем в один цвет, поэтому категориальный признак house color , вероятно, позволит одному примеру иметь несколько значений (например, yellow и black ).

Категориальные признаки иногда называют дискретными признаками .

Сравните с числовыми данными .

каузальная языковая модель

#язык

Синоним однонаправленной языковой модели .

См. двунаправленную языковую модель , чтобы сравнить различные направленные подходы к языковому моделированию.

центр тяжести

#кластеризация

Центр кластера, определенный алгоритмом k-средних или k-медиан . Например, если k равно 3, то алгоритм k-средних или k-медиан находит 3 центроида.

кластеризация на основе центроида

#кластеризация

Категория алгоритмов кластеризации , которая организует данные в неиерархические кластеры. k-means — это наиболее широко используемый алгоритм кластеризации на основе центроидов.

В отличие от алгоритмов иерархической кластеризации .

контрольно-пропускной пункт

Данные, фиксирующие состояние параметров модели в определенный момент времени. Контрольные точки позволяют экспортировать веса моделей или выполнять обучение в течение нескольких сеансов. Контрольные точки также позволяют обучать продолжать прошлые ошибки (например, вытеснение задания).

учебный класс

Одно из набора перечисленных целевых значений для метки . Например, в модели бинарной классификации , обнаруживающей спам, есть два класса: спам и не спам . В многоклассовой модели классификации , которая идентифицирует породы собак, классами будут пудель , бигль , мопс и так далее.

классификационная модель

Тип модели , который различает два или более дискретных класса. Например, модель классификации обработки естественного языка может определить, было ли входное предложение французским, испанским или итальянским.

Сравните с регрессионной моделью .

порог классификации

Критерий скалярного значения, который сравнивается с прогнозируемой оценкой модели, чтобы отделить положительный класс от отрицательного класса . Используется при сопоставлении результатов логистической регрессии с бинарной классификацией . Например, рассмотрим модель логистической регрессии, которая определяет вероятность того, что данное сообщение электронной почты является спамом. Если порог классификации равен 0,9, то значения логистической регрессии выше 0,9 классифицируются как спам , а значения ниже 0,9 классифицируются как не спам .

несбалансированный по классам набор данных

Проблема бинарной классификации , в которой метки для двух классов имеют значительно разные частоты. Например, набор данных о заболеваниях, в котором 0,0001 примеров имеют положительные метки, а 0,9999 — отрицательные, является проблемой несбалансированности классов, но предсказатель футбольного матча, в котором 0,51 примеров отмечают победу одной команды, а 0,49 — победу другой команды, не является проблемой. проблема несбалансированности классов.

вырезка

Техника обработки выбросов путем выполнения одного или обоих из следующих действий:

  • Уменьшение значений характеристик , которые превышают максимальное пороговое значение, до этого максимального порогового значения.
  • Увеличение значений признаков, которые меньше минимального порога, до этого минимального порога.

Например, предположим, что <0,5 % значений определенного признака выходят за пределы диапазона 40–60. В этом случае вы можете сделать следующее:

  • Обрежьте все значения больше 60, чтобы они были ровно 60.
  • Обрежьте все значения меньше 40, чтобы они были ровно 40.

Вы также можете использовать отсечение, чтобы заставить значения градиента находиться в заданном диапазоне во время обучения.

Облачный ТПУ

#TensorFlow
#GoogleCloud

Специализированный аппаратный ускоритель, предназначенный для ускорения рабочих нагрузок машинного обучения в Google Cloud Platform.

кластеризация

#кластеризация

Группировка связанных примеров , особенно во время неконтролируемого обучения . После того, как все примеры сгруппированы, человек может дополнительно придать значение каждому кластеру.

Существует множество алгоритмов кластеризации. Например, алгоритм k-средних группирует примеры на основе их близости к центроиду , как показано на следующей диаграмме:

Двумерный график, на котором ось X обозначена как «ширина дерева», а ось Y — как «высота дерева». График содержит два центроида и несколько десятков точек данных. Точки данных классифицируются в зависимости от их близости. То есть точки данных, ближайшие к одному центроиду, классифицируются как «кластер 1», а точки данных, ближайшие к другому центроиду, классифицируются как «кластер 2».

Затем человек-исследователь может просмотреть кластеры и, например, пометить кластер 1 как «карликовые деревья», а кластер 2 — как «полноразмерные деревья».

В качестве другого примера рассмотрим алгоритм кластеризации, основанный на расстоянии примера от центральной точки, показанном ниже:

Десятки точек данных расположены концентрическими кругами, почти как отверстия вокруг центра доски для дартса. Самое внутреннее кольцо точек данных классифицируется как «кластер 1», среднее кольцо — как «кластер 2», а самое внешнее кольцо — как «кластер 3».

коадаптация

Когда нейроны предсказывают закономерности в обучающих данных, полагаясь почти исключительно на выходные данные конкретных других нейронов, а не на поведение сети в целом. Когда шаблоны, вызывающие коадаптацию, отсутствуют в проверочных данных, то коадаптация вызывает переоснащение. Регуляризация отсева снижает коадаптацию, потому что отсев гарантирует, что нейроны не могут полагаться исключительно на определенные другие нейроны.

совместная фильтрация

#рексистемы

Делать прогнозы об интересах одного пользователя на основе интересов многих других пользователей. Совместная фильтрация часто используется в рекомендательных системах .

условие

#дф

В дереве решений — любой узел , оценивающий выражение. Например, следующая часть дерева решений содержит два условия:

Дерево решений, состоящее из двух условий: (x > 0) и (y > 0).

Условие также называется разбиением или тестом.

Состояние контраста с листом .

Смотрите также:

Подтверждение смещения

#справедливость

Склонность искать, интерпретировать, отдавать предпочтение и вспоминать информацию таким образом, чтобы подтвердить ранее существовавшие убеждения или гипотезы. Разработчики машинного обучения могут непреднамеренно собирать или маркировать данные таким образом, что это влияет на результат, поддерживающий их существующие убеждения. Предвзятость подтверждения — это форма неявной предвзятости .

Предвзятость экспериментатора — это форма предвзятости подтверждения, при которой экспериментатор продолжает обучать модели до тех пор, пока не будет подтверждена ранее существовавшая гипотеза.

матрица путаницы

Таблица NxN, которая объединяет правильные и неправильные предположения модели классификации . Одна ось матрицы путаницы — это метка , предсказанная моделью, а другая ось — это основная правда . N представляет количество классов . Например, N=2 для модели бинарной классификации . Например, вот образец матрицы путаницы для модели бинарной классификации:

Опухоль (прогнозируется) Неопухолевый (прогнозируемый)
Опухоль (наземная правда) 18 1
Неопухолевые (наземная правда) 6 452

Предыдущая матрица путаницы показывает, что из 19 образцов, которые действительно имели опухоли, модель правильно классифицировала 18 как имеющие опухоли (18 истинно положительных результатов ) и неправильно классифицировала 1 как не имеющие опухоли (1 ложноотрицательный результат). Точно так же из 458 образцов, которые на самом деле не имели опухолей, 452 были классифицированы правильно (452 ​​истинно отрицательных результата) и 6 были классифицированы неправильно (6 ложноположительных результатов ).

Матрица путаницы для проблемы классификации нескольких классов может помочь вам определить шаблоны ошибок. Например, матрица путаницы может показать, что модель, обученная распознавать рукописные цифры, склонна ошибочно предсказывать 9 вместо 4 или ошибочно предсказывать 1 вместо 7. В качестве другого примера рассмотрим следующую матрицу путаницы для трехклассового мультикласса. Классификационная модель, которая классифицирует три различных типа радужной оболочки (Virginica, Versicolor и Setosa). Когда основной истиной была Вирджиния, матрица путаницы показывает, что модель с гораздо большей вероятностью ошибочно предсказывала Versicolor, чем Setosa:

Сетоса (прогноз) Версиколор (прогноз) Вирджиния (прогноз)
Сетоса (наземная правда) 88 12 0
Версиколор (наземная правда) 6 141 7
Вирджиния (наземная правда) 2 27 109

Матрицы путаницы содержат достаточно информации для расчета различных показателей производительности, включая точность и полноту .

непрерывная функция

Функция с плавающей запятой с бесконечным диапазоном возможных значений. В отличие от дискретной функции .

удобная выборка

Использование набора данных, не собранного с научной точки зрения, для проведения быстрых экспериментов. Позже важно переключиться на научно собранный набор данных.

конвергенция

Неформально часто относится к состоянию, достигнутому во время обучения , при котором потери при обучении и потери при проверке меняются очень мало или вообще не меняются с каждой итерацией после определенного количества итераций. Другими словами, модель достигает сходимости, когда дополнительное обучение на текущих данных не улучшит модель. В глубоком обучении значения потерь иногда остаются постоянными или почти постоянными в течение многих итераций, прежде чем, наконец, уменьшаться, временно создавая ложное ощущение сходимости.

См. также раннюю остановку .

См. также Бойд и Ванденберге, Выпуклая оптимизация .

выпуклая функция

Функция, у которой область над графиком функции является выпуклым множеством . Прототип выпуклой функции имеет форму буквы U. Например, все следующие выпуклые функции:

U-образные кривые, каждая с одной точкой минимума.

Напротив, следующая функция не является выпуклой. Обратите внимание, что область над графиком не является выпуклым множеством:

W-образная кривая с двумя разными точками локального минимума.

Строго выпуклая функция имеет ровно одну точку локального минимума, которая также является точкой глобального минимума. Классические U-образные функции являются строго выпуклыми функциями. Однако некоторые выпуклые функции (например, прямые) не имеют U-образной формы.

Многие общие функции потерь , в том числе следующие, являются выпуклыми функциями:

Многие варианты градиентного спуска гарантированно находят точку, близкую к минимуму строго выпуклой функции. Точно так же многие варианты стохастического градиентного спуска имеют высокую вероятность (хотя и не гарантию) нахождения точки, близкой к минимуму строго выпуклой функции.

Сумма двух выпуклых функций (например, L 2 потерь + L 1 регуляризация) является выпуклой функцией.

Глубокие модели никогда не бывают выпуклыми функциями. Примечательно, что алгоритмы, разработанные для выпуклой оптимизации , в любом случае имеют тенденцию находить достаточно хорошие решения в глубоких сетях, даже если эти решения не гарантируют глобальный минимум.

выпуклая оптимизация

Процесс использования математических методов, таких как градиентный спуск , для нахождения минимума выпуклой функции . Многие исследования в области машинного обучения были сосредоточены на формулировании различных задач как задач выпуклой оптимизации и более эффективном решении этих проблем.

Для получения полной информации см. Boyd and Vandenberghe, Convex Optimization .

выпуклое множество

Подмножество евклидова пространства, в котором линия, проведенная между любыми двумя точками в подмножестве, полностью остается внутри подмножества. Например, следующие две фигуры являются выпуклыми множествами:

Одна иллюстрация прямоугольника. Еще одна иллюстрация овала.

Напротив, следующие две формы не являются выпуклыми множествами:

Одна иллюстрация круговой диаграммы с отсутствующим фрагментом. Еще одна иллюстрация крайне неправильного многоугольника.

свертка

#изображение

В математике, грубо говоря, смесь двух функций. В машинном обучении свертка смешивает сверточный фильтр и входную матрицу для обучения весов .

Термин «свертка» в машинном обучении часто представляет собой сокращенное обозначение либо операции свертки, либо слоя свертки .

Без сверток алгоритм машинного обучения должен был бы выучить отдельный вес для каждой ячейки в большом тензоре . Например, алгоритм машинного обучения, обучающийся на изображениях размером 2K x 2K, должен будет найти 4 миллиона отдельных весов. Благодаря сверткам алгоритм машинного обучения должен только найти веса для каждой ячейки в сверточном фильтре , что значительно сокращает память, необходимую для обучения модели. Когда применяется сверточный фильтр, он просто реплицируется по ячейкам, так что каждая из них умножается на фильтр.

сверточный фильтр

#изображение

Один из двух действующих лиц сверточной операции . (Другой актер — это срез входной матрицы.) Сверточный фильтр — это матрица, имеющая тот же ранг , что и входная матрица, но меньшую форму. Например, при входной матрице 28x28 фильтром может быть любая двумерная матрица меньше 28x28.

При фотографических манипуляциях все ячейки в сверточном фильтре обычно имеют постоянный набор единиц и нулей. В машинном обучении сверточные фильтры обычно заполняются случайными числами, а затем сеть обучает идеальные значения.

сверточный слой

#изображение

Слой глубокой нейронной сети, в котором сверточный фильтр проходит по входной матрице. Например, рассмотрим следующий сверточный фильтр 3x3:

Матрица 3x3 со следующими значениями: [[0,1,0], [1,0,1], [0,1,0]]

Следующая анимация показывает сверточный слой, состоящий из 9 сверточных операций с использованием входной матрицы 5x5. Обратите внимание, что каждая сверточная операция работает с другим фрагментом входной матрицы размером 3x3. Результирующая матрица 3x3 (справа) состоит из результатов 9 сверточных операций:

Анимация, показывающая две матрицы. Первая матрица представляет собой матрицу 5x5: [[128,97,53,201,198], [35,22,25,200,195], [37,24,28,197,182], [33,28,92,195,179], [31,40,100,192,177]]. Вторая матрица представляет собой матрицу 3x3: [[181 303 618], [115 338 605], [169 351 560]]. Вторая матрица вычисляется путем применения сверточной фильтрации [[0, 1, 0], [1, 0, 1], [0, 1, 0]] к разным подмножествам 3x3 матрицы 5x5.

сверточная нейронная сеть

#изображение

Нейронная сеть, в которой хотя бы один слой является свёрточным . Типичная сверточная нейронная сеть состоит из некоторой комбинации следующих слоев:

Сверточные нейронные сети добились больших успехов в определенных задачах, таких как распознавание изображений.

сверточная операция

#изображение

Следующая двухшаговая математическая операция:

  1. Поэлементное умножение сверточного фильтра и среза входной матрицы. (Срез входной матрицы имеет тот же ранг и размер, что и сверточный фильтр.)
  2. Суммирование всех значений в результирующей матрице продукта.

Например, рассмотрим следующую входную матрицу 5x5:

Матрица 5x5: [[128,97,53,201,198], [35,22,25,200,195], [37,24,28,197,182], [33,28,92,195,179], [31,40,100,192,177]].

Now imagine the following 2x2 convolutional filter:

The 2x2 matrix: [[1, 0], [0, 1]]

Each convolutional operation involves a single 2x2 slice of the input matrix. For instance, suppose we use the 2x2 slice at the top-left of the input matrix. So, the convolution operation on this slice looks as follows:

Applying the convolutional filter [[1, 0], [0, 1]] to the top-left
          2x2 section of the input matrix, which is [[128,97], [35,22]].
          The convolutional filter leaves the 128 and 22 intact, but zeroes
          out the 97 and 35. Consequently, the convolution operation yields
          the value 150 (128+22).

A convolutional layer consists of a series of convolutional operations, each acting on a different slice of the input matrix.

cost

Synonym for loss .

co-training

A semi-supervised learning approach particularly useful when all of the following conditions are true:

Co-training essentially amplifies independent signals into a stronger signal. For instance, consider a classification model that categorizes individual used cars as either Good or Bad . One set of predictive features might focus on aggregate characteristics such as the year, make, and model of the car; another set of predictive features might focus on the previous owner's driving record and the car's maintenance history.

The seminal paper on co-training is Combining Labeled and Unlabeled Data with Co-Training by Blum and Mitchell.

counterfactual fairness

#fairness
A fairness metric that checks whether a classifier produces the same result for one individual as it does for another individual who is identical to the first, except with respect to one or more sensitive attributes . Evaluating a classifier for counterfactual fairness is one method for surfacing potential sources of bias in a model.

See "When Worlds Collide: Integrating Different Counterfactual Assumptions in Fairness" for a more detailed discussion of counterfactual fairness.

coverage bias

#fairness

See selection bias .

crash blossom

#language

A sentence or phrase with an ambiguous meaning. Crash blossoms present a significant problem in natural language understanding . For example, the headline Red Tape Holds Up Skyscraper is a crash blossom because an NLU model could interpret the headline literally or figuratively.

critic

#rl

Synonym for Deep Q-Network .

cross-entropy

A generalization of Log Loss to multi-class classification problems . Cross-entropy quantifies the difference between two probability distributions. See also perplexity .

cross-validation

A mechanism for estimating how well a model would generalize to new data by testing the model against one or more non-overlapping data subsets withheld from the training set .

Д

data analysis

Obtaining an understanding of data by considering samples, measurement, and visualization. Data analysis can be particularly useful when a dataset is first received, before one builds the first model . It is also crucial in understanding experiments and debugging problems with the system.

data augmentation

#image

Artificially boosting the range and number of training examples by transforming existing examples to create additional examples. For example, suppose images are one of your features , but your dataset doesn't contain enough image examples for the model to learn useful associations. Ideally, you'd add enough labeled images to your dataset to enable your model to train properly. If that's not possible, data augmentation can rotate, stretch, and reflect each image to produce many variants of the original picture, possibly yielding enough labeled data to enable excellent training.

DataFrame

A popular datatype for representing datasets in pandas . A DataFrame is analogous to a table. Each column of the DataFrame has a name (a header), and each row is identified by a number.

data parallelism

A way of scaling training or inference that replicates an entire model onto multiple devices and then passes a subset of the input data to each device. Data parallelism can enable training and inference on very large batch sizes ; however, data parallelism requires that the model be small enough to fit on all devices.

See also model parallelism .

data set or dataset

A collection of examples .

Dataset API (tf.data)

#TensorFlow

A high-level TensorFlow API for reading data and transforming it into a form that a machine learning algorithm requires. A tf.data.Dataset object represents a sequence of elements, in which each element contains one or more Tensors . A tf.data.Iterator object provides access to the elements of a Dataset .

For details about the Dataset API, see tf.data: Build TensorFlow input pipelines in the TensorFlow Programmer's Guide .

decision boundary

The separator between classes learned by a model in a binary class or multi-class classification problems . For example, in the following image representing a binary classification problem, the decision boundary is the frontier between the orange class and the blue class:

A well-defined boundary between one class and another.

лес решений

#дф

Модель, созданная из нескольких деревьев решений . Лес решений делает прогноз, агрегируя прогнозы своих деревьев решений. Популярные типы лесов решений включают случайные леса и деревья с градиентным усилением .

decision threshold

Synonym for classification threshold .

Древо решений

#дф

Модель контролируемого обучения, состоящая из набора условий и листьев , организованных иерархически. Например, следующее дерево решений:

Дерево решений, состоящее из четырех иерархически расположенных условий, ведущих к пяти листьям.

deep model

A type of neural network containing multiple hidden layers .

Contrast with wide model .

decoder

#language

In general, any ML system that converts from a processed, dense, or internal representation to a more raw, sparse, or external representation.

Decoders are often a component of a larger model, where they are frequently paired with an encoder .

In sequence-to-sequence tasks , a decoder starts with the internal state generated by the encoder to predict the next sequence.

Refer to Transformer for the definition of a decoder within the Transformer architecture.

deep neural network

Synonym for deep model .

Deep Q-Network (DQN)

#rl

In Q-learning , a deep neural network that predicts Q-functions .

Critic is a synonym for Deep Q-Network.

demographic parity

#fairness

A fairness metric that is satisfied if the results of a model's classification are not dependent on a given sensitive attribute .

For example, if both Lilliputians and Brobdingnagians apply to Glubbdubdrib University, demographic parity is achieved if the percentage of Lilliputians admitted is the same as the percentage of Brobdingnagians admitted, irrespective of whether one group is on average more qualified than the other.

Contrast with equalized odds and equality of opportunity , which permit classification results in aggregate to depend on sensitive attributes, but do not permit classification results for certain specified ground-truth labels to depend on sensitive attributes. See "Attacking discrimination with smarter machine learning" for a visualization exploring the tradeoffs when optimizing for demographic parity.

denoising

#language

A common approach to self-supervised learning in which:

  1. Noise is artificially added to the dataset.
  2. The model tries to remove the noise.

Denoising enables learning from unlabeled examples . The original dataset serves as the target or label and the noisy data as the input.

Some masked language models use denoising as follows:

  1. Noise is artificially added to an unlabeled sentence by masking some of the tokens.
  2. The model tries to predict the original tokens.

dense feature

A feature in which most values are non-zero, typically a Tensor of floating-point values.

Contrast with sparse feature .

dense layer

Synonym for fully connected layer .

depth

The number of layers (including any embedding layers) in a neural network that learn weights . For example, a neural network with 5 hidden layers and 1 output layer has a depth of 6.

depthwise separable convolutional neural network (sepCNN)

#image

A convolutional neural network architecture based on Inception , but where Inception modules are replaced with depthwise separable convolutions. Also known as Xception.

A depthwise separable convolution (also abbreviated as separable convolution) factors a standard 3-D convolution into two separate convolution operations that are more computationally efficient: first, a depthwise convolution, with a depth of 1 (n ✕ n ✕ 1), and then second, a pointwise convolution, with length and width of 1 (1 ✕ 1 ✕ n).

To learn more, see Xception: Deep Learning with Depthwise Separable Convolutions .

derived label

Synonym for proxy label .

device

#TensorFlow

A category of hardware that can run a TensorFlow session, including CPUs, GPUs, and TPUs .

dimension reduction

Decreasing the number of dimensions used to represent a particular feature in a feature vector, typically by converting to an embedding .

dimensions

Overloaded term having any of the following definitions:

  • The number of levels of coordinates in a Tensor . Например:

    • A scalar has zero dimensions; for example, ["Hello"] .
    • A vector has one dimension; for example, [3, 5, 7, 11] .
    • A matrix has two dimensions; for example, [[2, 4, 18], [5, 7, 14]] .

    You can uniquely specify a particular cell in a one-dimensional vector with one coordinate; you need two coordinates to uniquely specify a particular cell in a two-dimensional matrix.

  • The number of entries in a feature vector .

  • The number of elements in an embedding layer.

discrete feature

A feature with a finite set of possible values. For example, a feature whose values may only be animal , vegetable , or mineral is a discrete (or categorical) feature.

Contrast with continuous feature .

discriminative model

A model that predicts labels from a set of one or more features . More formally, discriminative models define the conditional probability of an output given the features and weights ; that is:

p(output | features, weights)

For example, a model that predicts whether an email is spam from features and weights is a discriminative model.

The vast majority of supervised learning models, including classification and regression models, are discriminative models.

Contrast with generative model .

discriminator

A system that determines whether examples are real or fake.

Alternatively, the subsystem within a generative adversarial network that determines whether the examples created by the generator are real or fake.

disparate impact

#fairness

Making decisions about people that impact different population subgroups disproportionately. This usually refers to situations where an algorithmic decision-making process harms or benefits some subgroups more than others.

For example, suppose an algorithm that determines a Lilliputian's eligibility for a miniature-home loan is more likely to classify them as “ineligible” if their mailing address contains a certain postal code. If Big-Endian Lilliputians are more likely to have mailing addresses with this postal code than Little-Endian Lilliputians, then this algorithm may result in disparate impact.

Contrast with disparate treatment , which focuses on disparities that result when subgroup characteristics are explicit inputs to an algorithmic decision-making process.

disparate treatment

#fairness

Factoring subjects' sensitive attributes into an algorithmic decision-making process such that different subgroups of people are treated differently.

For example, consider an algorithm that determines Lilliputians' eligibility for a miniature-home loan based on the data they provide in their loan application. If the algorithm uses a Lilliputian's affiliation as Big-Endian or Little-Endian as an input, it is enacting disparate treatment along that dimension.

Contrast with disparate impact , which focuses on disparities in the societal impacts of algorithmic decisions on subgroups, irrespective of whether those subgroups are inputs to the model.

divisive clustering

#clustering

See hierarchical clustering .

downsampling

#image

Overloaded term that can mean either of the following:

  • Reducing the amount of information in a feature in order to train a model more efficiently. For example, before training an image recognition model, downsampling high-resolution images to a lower-resolution format.
  • Training on a disproportionately low percentage of over-represented class examples in order to improve model training on under-represented classes. For example, in a class-imbalanced dataset , models tend to learn a lot about the majority class and not enough about the minority class . Downsampling helps balance the amount of training on the majority and minority classes.

DQN

#rl

Abbreviation for Deep Q-Network .

dropout regularization

A form of regularization useful in training neural networks . Dropout regularization removes a random selection of a fixed number of the units in a network layer for a single gradient step. The more units dropped out, the stronger the regularization. This is analogous to training the network to emulate an exponentially large ensemble of smaller networks. For full details, see Dropout: A Simple Way to Prevent Neural Networks from Overfitting .

dynamic model

A model that is trained online in a continuously updating fashion. That is, data is continuously entering the model.

Е

eager execution

#TensorFlow

A TensorFlow programming environment in which operations run immediately. By contrast, operations called in graph execution don't run until they are explicitly evaluated. Eager execution is an imperative interface , much like the code in most programming languages. Eager execution programs are generally far easier to debug than graph execution programs.

early stopping

A method for regularization that involves ending model training before training loss finishes decreasing. In early stopping, you end model training when the loss on a validation dataset starts to increase, that is, when generalization performance worsens.

earth mover's distance (EMD)

A measure of the relative similarity between two documents. The lower the value, the more similar the documents.

embeddings

#language

A categorical feature represented as a continuous-valued feature. Typically, an embedding is a translation of a high-dimensional vector into a low-dimensional space. For example, you can represent the words in an English sentence in either of the following two ways:

  • As a million-element (high-dimensional) sparse vector in which all elements are integers. Each cell in the vector represents a separate English word; the value in a cell represents the number of times that word appears in a sentence. Since a single English sentence is unlikely to contain more than 50 words, nearly every cell in the vector will contain a 0. The few cells that aren't 0 will contain a low integer (usually 1) representing the number of times that word appeared in the sentence.
  • As a several-hundred-element (low-dimensional) dense vector in which each element holds a floating-point value between 0 and 1. This is an embedding.

In TensorFlow, embeddings are trained by backpropagating loss just like any other parameter in a neural network .

embedding space

#language

The d-dimensional vector space that features from a higher-dimensional vector space are mapped to. Ideally, the embedding space contains a structure that yields meaningful mathematical results; for example, in an ideal embedding space, addition and subtraction of embeddings can solve word analogy tasks.

The dot product of two embeddings is a measure of their similarity.

empirical risk minimization (ERM)

Choosing the function that minimizes loss on the training set. Contrast with structural risk minimization .

encoder

#language

In general, any ML system that converts from a raw, sparse, or external representation into a more processed, denser, or more internal representation.

Encoders are often a component of a larger model, where they are frequently paired with a decoder . Some Transformers pair encoders with decoders, though other Transformers use only the encoder or only the decoder.

Some systems use the encoder's output as the input to a classification or regression network.

In sequence-to-sequence tasks , an encoder takes an input sequence and returns an internal state (a vector). Then, the decoder uses that internal state to predict the next sequence.

Refer to Transformer for the definition of an encoder in the Transformer architecture.

ensemble

A collection of models trained independently whose predictions are averaged or aggregated. In many cases, an ensemble produces better predictions than a single model. For example, a random forest is an ensemble built from multiple decision trees . Note that not all decision forests are ensembles.

энтропия

#дф

В , описание того, насколько непредсказуемо распределение вероятностей. В качестве альтернативы энтропия также определяется как количество информации, содержащейся в каждом примере . Распределение имеет максимально возможную энтропию, когда все значения случайной величины равновероятны.

Энтропия множества с двумя возможными значениями «0» и «1» (например, метки в задаче бинарной классификации ) имеет следующую формулу:

H = -p log p - q log q = -p log p - (1-p) * log (1-p)

куда:

  • H - энтропия.
  • p - доля "1" примеров.
  • q — доля «0» примеров. Обратите внимание, что q = (1 - p)
  • log обычно log 2 . В данном случае единицей энтропии является бит.

Например, предположим следующее:

  • 100 примеров содержат значение "1"
  • 300 примеров содержат значение "0"

Следовательно, значение энтропии равно:

  • р = 0,25
  • д = 0,75
  • H = (-0,25)log 2 (0,25) - (0,75)log 2 (0,75) = 0,81 бит на пример

Идеально сбалансированный набор (например, 200 «0» и 200 «1») будет иметь энтропию 1,0 бит на пример. По мере того, как набор становится более несбалансированным , его энтропия приближается к 0,0.

В деревьях решений энтропия помогает сформулировать получение информации, чтобы помочь разделителю выбрать условия во время роста дерева решений классификации.

Сравните энтропию с:

Энтропию часто называют энтропией Шеннона.

environment

#rl

In reinforcement learning, the world that contains the agent and allows the agent to observe that world's state . For example, the represented world can be a game like chess, or a physical world like a maze. When the agent applies an action to the environment, then the environment transitions between states.

episode

#rl

In reinforcement learning, each of the repeated attempts by the agent to learn an environment .

epoch

A full training pass over the entire dataset such that each example has been seen once. Thus, an epoch represents N / batch size training iterations , where N is the total number of examples.

epsilon greedy policy

#rl

In reinforcement learning, a policy that either follows a random policy with epsilon probability or a greedy policy otherwise. For example, if epsilon is 0.9, then the policy follows a random policy 90% of the time and a greedy policy 10% of the time.

Over successive episodes, the algorithm reduces epsilon's value in order to shift from following a random policy to following a greedy policy. By shifting the policy, the agent first randomly explores the environment and then greedily exploits the results of random exploration.

equality of opportunity

#fairness
A fairness metric that checks whether, for a preferred label (one that confers an advantage or benefit to a person) and a given attribute , a classifier predicts that preferred label equally well for all values of that attribute. In other words, equality of opportunity measures whether the people who should qualify for an opportunity are equally likely to do so regardless of their group membership.

For example, suppose Glubbdubdrib University admits both Lilliputians and Brobdingnagians to a rigorous mathematics program. Lilliputians' secondary schools offer a robust curriculum of math classes, and the vast majority of students are qualified for the university program. Brobdingnagians' secondary schools don't offer math classes at all, and as a result, far fewer of their students are qualified. Equality of opportunity is satisfied for the preferred label of "admitted" with respect to nationality (Lilliputian or Brobdingnagian) if qualified students are equally likely to be admitted irrespective of whether they're a Lilliputian or a Brobdingnagian.

For example, let's say 100 Lilliputians and 100 Brobdingnagians apply to Glubbdubdrib University, and admissions decisions are made as follows:

Table 1. Lilliputian applicants (90% are qualified)

Qualified Unqualified
Admitted 45 3
Rejected 45 7
Total 90 10
Percentage of qualified students admitted: 45/90 = 50%
Percentage of unqualified students rejected: 7/10 = 70%
Total percentage of Lilliputian students admitted: (45+3)/100 = 48%

Table 2. Brobdingnagian applicants (10% are qualified):

Qualified Unqualified
Admitted 5 9
Rejected 5 81
Total 10 90
Percentage of qualified students admitted: 5/10 = 50%
Percentage of unqualified students rejected: 81/90 = 90%
Total percentage of Brobdingnagian students admitted: (5+9)/100 = 14%

The preceding examples satisfy equality of opportunity for acceptance of qualified students because qualified Lilliputians and Brobdingnagians both have a 50% chance of being admitted.

See "Equality of Opportunity in Supervised Learning" for a more detailed discussion of equality of opportunity. Also see "Attacking discrimination with smarter machine learning" for a visualization exploring the tradeoffs when optimizing for equality of opportunity.

equalized odds

#fairness
A fairness metric that checks if, for any particular label and attribute, a classifier predicts that label equally well for all values of that attribute.

For example, suppose Glubbdubdrib University admits both Lilliputians and Brobdingnagians to a rigorous mathematics program. Lilliputians' secondary schools offer a robust curriculum of math classes, and the vast majority of students are qualified for the university program. Brobdingnagians' secondary schools don't offer math classes at all, and as a result, far fewer of their students are qualified. Equalized odds is satisfied provided that no matter whether an applicant is a Lilliputian or a Brobdingnagian, if they are qualified, they are equally as likely to get admitted to the program, and if they are not qualified, they are equally as likely to get rejected.

Let's say 100 Lilliputians and 100 Brobdingnagians apply to Glubbdubdrib University, and admissions decisions are made as follows:

Table 3. Lilliputian applicants (90% are qualified)

Qualified Unqualified
Admitted 45 2
Rejected 45 8
Total 90 10
Percentage of qualified students admitted: 45/90 = 50%
Percentage of unqualified students rejected: 8/10 = 80%
Total percentage of Lilliputian students admitted: (45+2)/100 = 47%

Table 4. Brobdingnagian applicants (10% are qualified):

Qualified Unqualified
Admitted 5 18
Rejected 5 72
Total 10 90
Percentage of qualified students admitted: 5/10 = 50%
Percentage of unqualified students rejected: 72/90 = 80%
Total percentage of Brobdingnagian students admitted: (5+18)/100 = 23%

Equalized odds is satisfied because qualified Lilliputian and Brobdingnagian students both have a 50% chance of being admitted, and unqualified Lilliputian and Brobdingnagian have an 80% chance of being rejected.

Equalized odds is formally defined in "Equality of Opportunity in Supervised Learning" as follows: "predictor Ŷ satisfies equalized odds with respect to protected attribute A and outcome Y if Ŷ and A are independent, conditional on Y."

Estimator

#TensorFlow

A deprecated TensorFlow API. Use tf.keras instead of Estimators.

example

One row of a dataset. An example contains one or more features and possibly a label . See also labeled example and unlabeled example .

experience replay

#rl

In reinforcement learning, a DQN technique used to reduce temporal correlations in training data. The agent stores state transitions in a replay buffer , and then samples transitions from the replay buffer to create training data.

experimenter's bias

#fairness

See confirmation bias .

exploding gradient problem

#seq

The tendency for gradients in a deep neural networks (especially recurrent neural networks ) to become surprisingly steep (high). Steep gradients result in very large updates to the weights of each node in a deep neural network.

Models suffering from the exploding gradient problem become difficult or impossible to train. Gradient clipping can mitigate this problem.

Compare to vanishing gradient problem .

Ф

fairness constraint

#fairness
Applying a constraint to an algorithm to ensure one or more definitions of fairness are satisfied. Examples of fairness constraints include:

fairness metric

#fairness

A mathematical definition of “fairness” that is measurable. Some commonly used fairness metrics include:

Many fairness metrics are mutually exclusive; see incompatibility of fairness metrics .

false negative (FN)

An example in which the model mistakenly predicted the negative class . For example, the model inferred that a particular email message was not spam (the negative class), but that email message actually was spam.

false negative rate

The proportion of actual positive examples for which the negative class is predicted. False negative rate is calculated as follows:

$$\text{False Negative Rate} = \frac{\text{False Negatives}}{\text{False Negatives} + \text{True Positives}}$$

false positive (FP)

An example in which the model mistakenly predicted the positive class . For example, the model inferred that a particular email message was spam (the positive class), but that email message was actually not spam.

false positive rate (FPR)

The x-axis in an ROC curve . The false positive rate is defined as follows:

$$\text{False Positive Rate} = \frac{\text{False Positives}}{\text{False Positives} + \text{True Negatives}}$$

feature

An input variable used in making predictions .

feature cross

A synthetic feature formed by crossing (taking a Cartesian product of) individual binary features obtained from categorical data or from continuous features via bucketing . Feature crosses help represent nonlinear relationships.

feature engineering

The process of determining which features might be useful in training a model, and then converting raw data from log files and other sources into said features. In TensorFlow, feature engineering often means converting raw log file entries to tf.Example protocol buffers. See also tf.Transform .

Feature engineering is sometimes called feature extraction .

feature extraction

Overloaded term having either of the following definitions:

особенности важности

#дф

Синоним переменной важности .

feature set

The group of features your machine learning model trains on. For example, postal code, property size, and property condition might comprise a simple feature set for a model that predicts housing prices.

feature spec

#TensorFlow

Describes the information required to extract features data from the tf.Example protocol buffer. Because the tf.Example protocol buffer is just a container for data, you must specify the following:

  • the data to extract (that is, the keys for the features)
  • the data type (for example, float or int)
  • The length (fixed or variable)

feature vector

The list of feature values representing an example passed into a model.

federated learning

A distributed machine learning approach that trains machine learning models using decentralized examples residing on devices such as smartphones. In federated learning, a subset of devices downloads the current model from a central coordinating server. The devices use the examples stored on the devices to make improvements to the model. The devices then upload the model improvements (but not the training examples) to the coordinating server, where they are aggregated with other updates to yield an improved global model. After the aggregation, the model updates computed by devices are no longer needed, and can be discarded.

Since the training examples are never uploaded, federated learning follows the privacy principles of focused data collection and data minimization.

For more information about federated learning, see this tutorial .

feedback loop

In machine learning, a situation in which a model's predictions influence the training data for the same model or another model. For example, a model that recommends movies will influence the movies that people see, which will then influence subsequent movie recommendation models.

feedforward neural network (FFN)

A neural network without cyclic or recursive connections. For example, traditional deep neural networks are feedforward neural networks. Contrast with recurrent neural networks , which are cyclic.

few-shot learning

A machine learning approach, often used for object classification, designed to learn effective classifiers from only a small number of training examples.

See also one-shot learning .

fine tuning

Perform a secondary optimization to adjust the parameters of an already trained model to fit a new problem. Fine tuning often refers to refitting the weights of a trained unsupervised model to a supervised model.

forget gate

#seq

The portion of a Long Short-Term Memory cell that regulates the flow of information through the cell. Forget gates maintain context by deciding which information to discard from the cell state.

full softmax

See softmax . Contrast with candidate sampling .

fully connected layer

A hidden layer in which each node is connected to every node in the subsequent hidden layer.

A fully connected layer is also known as a dense layer .

грамм

GAN

Abbreviation for generative adversarial network .

generalization

Refers to your model's ability to make correct predictions on new, previously unseen data as opposed to the data used to train the model.

generalization curve

A loss curve showing both the training set and the validation set . A generalization curve can help you detect possible overfitting . For example, the following generalization curve suggests overfitting because loss for the validation set ultimately becomes significantly higher than for the training set.

A Cartesian plot in which the y-axis is labeled 'loss' and the x-axis
          is labeled 'iterations'. Two graphs appear. One graph shows a loss
          curve for a training set and the other graph shows a loss curve for a
          validation set. The two curves start off similarly, but the curve for
          the training set eventually dips far lower than the curve for the
          validation set.

generalized linear model

A generalization of least squares regression models, which are based on Gaussian noise , to other types of models based on other types of noise, such as Poisson noise or categorical noise. Examples of generalized linear models include:

The parameters of a generalized linear model can be found through convex optimization .

Generalized linear models exhibit the following properties:

  • The average prediction of the optimal least squares regression model is equal to the average label on the training data.
  • The average probability predicted by the optimal logistic regression model is equal to the average label on the training data.

The power of a generalized linear model is limited by its features. Unlike a deep model, a generalized linear model cannot "learn new features."

generative adversarial network (GAN)

A system to create new data in which a generator creates data and a discriminator determines whether that created data is valid or invalid.

generative model

Practically speaking, a model that does either of the following:

  • Creates (generates) new examples from the training dataset. For example, a generative model could create poetry after training on a dataset of poems. The generator part of a generative adversarial network falls into this category.
  • Determines the probability that a new example comes from the training set, or was created from the same mechanism that created the training set. For example, after training on a dataset consisting of English sentences, a generative model could determine the probability that new input is a valid English sentence.

A generative model can theoretically discern the distribution of examples or particular features in a dataset. That is:

p(examples)

Unsupervised learning models are generative.

Contrast with discriminative models .

generator

The subsystem within a generative adversarial network that creates new examples .

Contrast with discriminative model .

GPT (Generative Pre-trained Transformer)

#language

A family of Transformer -based large language models developed by OpenAI .

GPT variants can apply to multiple modalities , including:

  • image generation (for example, ImageGPT)
  • text-to-image generation (for example, DALL-E ).

примесь Джини

#дф

Метрика похожа на энтропию . Разделители используют значения, полученные либо из примеси Джини, либо из энтропии, чтобы составить условия для деревьев решений классификации. Прирост информации происходит от энтропии. Не существует общепринятого эквивалентного термина для метрики, полученной из примеси Джини; однако этот безымянный показатель так же важен, как и прирост информации.

Примесь Джини также называют индексом Джини или просто Джини .

gradient

The vector of partial derivatives with respect to all of the independent variables. In machine learning, the gradient is the vector of partial derivatives of the model function. The gradient points in the direction of steepest ascent.

повышение градиента

#дф

Алгоритм обучения, в котором слабые модели обучаются для итеративного улучшения качества (уменьшения потерь) сильной модели. Например, слабой моделью может быть линейная модель или модель небольшого дерева решений. Сильная модель становится суммой всех предварительно обученных слабых моделей.

В простейшей форме повышения градиента на каждой итерации слабая модель обучается прогнозировать градиент потерь сильной модели. Затем выходные данные сильной модели обновляются путем вычитания предсказанного градиента, аналогично градиентному спуску .

$$F_{0} = 0$$$$F_{i+1} = F_i - \xi f_i $$

куда:

  • $F_{0}$ – начальная сильная модель.
  • $F_{i+1}$ — следующая сильная модель.
  • $F_{i}$ — текущая сильная модель.
  • $\xi$ — это значение от 0,0 до 1,0, называемое усадкой , которое аналогично скорости обучения в градиентном спуске.
  • $f_{i}$ — слабая модель, обученная предсказывать градиент потерь $F_{i}$.

Современные варианты повышения градиента также включают в свои вычисления вторую производную (Гессиана) потерь.

Деревья решений обычно используются в качестве слабых моделей при повышении градиента. См. деревья (решения) с градиентным усилением .

деревья (решения) с градиентным усилением (GBT)

#дф

Тип леса решений , в котором:

gradient clipping

#seq

A commonly used mechanism to mitigate the exploding gradient problem by artificially limiting (clipping) the maximum value of gradients when using gradient descent to train a model.

gradient descent

A technique to minimize loss by computing the gradients of loss with respect to the model's parameters, conditioned on training data. Informally, gradient descent iteratively adjusts parameters, gradually finding the best combination of weights and bias to minimize loss.

graph

#TensorFlow

In TensorFlow, a computation specification. Nodes in the graph represent operations. Edges are directed and represent passing the result of an operation (a Tensor ) as an operand to another operation. Use TensorBoard to visualize a graph.

graph execution

#TensorFlow

A TensorFlow programming environment in which the program first constructs a graph and then executes all or part of that graph. Graph execution is the default execution mode in TensorFlow 1.x.

Contrast with eager execution .

greedy policy

#rl

In reinforcement learning, a policy that always chooses the action with the highest expected return .

ground truth

The correct answer. Reality. Since reality is often subjective, expert raters typically are the proxy for ground truth.

group attribution bias

#fairness

Assuming that what is true for an individual is also true for everyone in that group. The effects of group attribution bias can be exacerbated if a convenience sampling is used for data collection. In a non-representative sample, attributions may be made that do not reflect reality.

See also out-group homogeneity bias and in-group bias .

H

hallucination

The production of plausible-seeming but factually incorrect output by a generative model that purports to be making an assertion about the real world. For example, if a dialog agent claims that Barack Obama died in 1865, the agent is hallucinating .

hashing

In machine learning, a mechanism for bucketing categorical data , particularly when the number of categories is large, but the number of categories actually appearing in the dataset is comparatively small.

For example, Earth is home to about 60,000 tree species. You could represent each of the 60,000 tree species in 60,000 separate categorical buckets. Alternatively, if only 200 of those tree species actually appear in a dataset, you could use hashing to divide tree species into perhaps 500 buckets.

A single bucket could contain multiple tree species. For example, hashing could place baobab and red maple —two genetically dissimilar species—into the same bucket. Regardless, hashing is still a good way to map large categorical sets into the desired number of buckets. Hashing turns a categorical feature having a large number of possible values into a much smaller number of values by grouping values in a deterministic way.

heuristic

A simple and quickly implemented solution to a problem. For example, "With a heuristic, we achieved 86% accuracy. When we switched to a deep neural network, accuracy went up to 98%."

hidden layer

A synthetic layer in a neural network between the input layer (that is, the features) and the output layer (the prediction). Hidden layers typically contain an activation function (such as ReLU ) for training. A deep neural network contains more than one hidden layer.

hierarchical clustering

#clustering

A category of clustering algorithms that create a tree of clusters. Hierarchical clustering is well-suited to hierarchical data, such as botanical taxonomies. There are two types of hierarchical clustering algorithms:

  • Agglomerative clustering first assigns every example to its own cluster, and iteratively merges the closest clusters to create a hierarchical tree.
  • Divisive clustering first groups all examples into one cluster and then iteratively divides the cluster into a hierarchical tree.

Contrast with centroid-based clustering .

hinge loss

A family of loss functions for classification designed to find the decision boundary as distant as possible from each training example, thus maximizing the margin between examples and the boundary. KSVMs use hinge loss (or a related function, such as squared hinge loss). For binary classification, the hinge loss function is defined as follows:

$$\text{loss} = \text{max}(0, 1 - (y * y'))$$

where y is the true label, either -1 or +1, and y' is the raw output of the classifier model:

$$y' = b + w_1x_1 + w_2x_2 + … w_nx_n$$

Consequently, a plot of hinge loss vs. (y * y') looks as follows:

A Cartesian plot consisting of two joined line segments. The first
          line segment starts at (-3, 4) and ends at (1, 0). The second line
          segment begins at (1, 0) and continues indefinitely with a slope
          of 0.

holdout data

Examples intentionally not used ("held out") during training. The validation dataset and test dataset are examples of holdout data. Holdout data helps evaluate your model's ability to generalize to data other than the data it was trained on. The loss on the holdout set provides a better estimate of the loss on an unseen dataset than does the loss on the training set.

hyperparameter

The "knobs" that youtweak during successive runs of training a model. For example, learning rate is a hyperparameter.

Contrast with parameter .

hyperplane

A boundary that separates a space into two subspaces. For example, a line is a hyperplane in two dimensions and a plane is a hyperplane in three dimensions. More typically in machine learning, a hyperplane is the boundary separating a high-dimensional space. Kernel Support Vector Machines use hyperplanes to separate positive classes from negative classes, often in a very high-dimensional space.

я

iid

Abbreviation for independently and identically distributed .

image recognition

#image

A process that classifies object(s), pattern(s), or concept(s) in an image. Image recognition is also known as image classification .

For more information, see ML Practicum: Image Classification .

imbalanced dataset

Synonym for class-imbalanced dataset .

implicit bias

#fairness

Automatically making an association or assumption based on one's mental models and memories. Implicit bias can affect the following:

  • How data is collected and classified.
  • How machine learning systems are designed and developed.

For example, when building a classifier to identify wedding photos, an engineer may use the presence of a white dress in a photo as a feature. However, white dresses have been customary only during certain eras and in certain cultures.

See also confirmation bias .

incompatibility of fairness metrics

#fairness

The idea that some notions of fairness are mutually incompatible and cannot be satisfied simultaneously. As a result, there is no single universal metric for quantifying fairness that can be applied to all ML problems.

While this may seem discouraging, incompatibility of fairness metrics doesn't imply that fairness efforts are fruitless. Instead, it suggests that fairness must be defined contextually for a given ML problem, with the goal of preventing harms specific to its use cases.

See "On the (im)possibility of fairness" for a more detailed discussion of this topic.

independently and identically distributed (iid)

Data drawn from a distribution that doesn't change, and where each value drawn doesn't depend on values that have been drawn previously. An iid is the ideal gas of machine learning—a useful mathematical construct but almost never exactly found in the real world. For example, the distribution of visitors to a web page may be iid over a brief window of time; that is, the distribution doesn't change during that brief window and one person's visit is generally independent of another's visit. However, if you expand that window of time, seasonal differences in the web page's visitors may appear.

individual fairness

#fairness

A fairness metric that checks whether similar individuals are classified similarly. For example, Brobdingnagian Academy might want to satisfy individual fairness by ensuring that two students with identical grades and standardized test scores are equally likely to gain admission.

Note that individual fairness relies entirely on how you define "similarity" (in this case, grades and test scores), and you can run the risk of introducing new fairness problems if your similarity metric misses important information (such as the rigor of a student's curriculum).

See "Fairness Through Awareness" for a more detailed discussion of individual fairness.

inference

In machine learning, often refers to the process of making predictions by applying the trained model to unlabeled examples . In statistics, inference refers to the process of fitting the parameters of a distribution conditioned on some observed data. (See the Wikipedia article on statistical inference .)

путь вывода

#дф

В дереве решений во время логического вывода маршрут, по которому конкретный пример идет от корня к другим условиям , заканчивающийся листом . Например, в следующем дереве решений более толстые стрелки показывают путь вывода для примера со следующими значениями признаков:

  • х = 7
  • у = 12
  • г = -3

Путь вывода на следующем рисунке проходит через три условия, прежде чем достичь листа ( Zeta ).

Дерево решений, состоящее из четырех условий и пяти листьев. Корневое условие (x > 0). Поскольку ответ «Да», путь вывода идет от корня к следующему условию (y > 0). Поскольку ответ «Да», путь вывода переходит к следующему условию (z > 0). Поскольку ответ «Нет», путь вывода проходит к своему конечному узлу, который является листом (дзета).

Три толстые стрелки показывают путь вывода.

получение информации

#дф

В лесах решений разница между энтропией узла и взвешенной (по количеству примеров) суммой энтропии его дочерних узлов. Энтропия узла — это энтропия примеров в этом узле.

Например, рассмотрим следующие значения энтропии:

  • энтропия родительского узла = 0,6
  • энтропия одного дочернего узла с 16 релевантными примерами = 0,2
  • энтропия другого дочернего узла с 24 релевантными примерами = 0,1

Таким образом, 40 % примеров находятся в одном дочернем узле, а 60 % — в другом дочернем узле. Следовательно:

  • взвешенная сумма энтропии дочерних узлов = (0,4 * 0,2) + (0,6 * 0,1) = 0,14

Итак, информационный прирост:

  • прирост информации = энтропия родительского узла - взвешенная сумма энтропии дочерних узлов
  • прирост информации = 0,6 - 0,14 = 0,46

Большинство сплиттеров стремятся создать условия , максимизирующие получение информации.

in-group bias

#fairness

Showing partiality to one's own group or own characteristics. If testers or raters consist of the machine learning developer's friends, family, or colleagues, then in-group bias may invalidate product testing or the dataset.

In-group bias is a form of group attribution bias . See also out-group homogeneity bias .

input layer

The first layer (the one that receives the input data) in a neural network .

состояние в комплекте

#дф

В дереве решений - условие , проверяющее наличие одного элемента в наборе элементов. Например, следующее является внутренним условием:

  house-style in [tudor, colonial, cape]

Во время логического вывода, если значением признака стиля дома является tudor , colonial стиль или cape , то это условие оценивается как Да. Если значение признака в стиле дома другое (например, ranch ), то это условие оценивается как Нет.

Условия в наборе обычно приводят к более эффективным деревьям решений, чем условия, которые тестируют функции с горячим кодированием .

instance

Synonym for example .

interpretability

The ability to explain or to present an ML model's reasoning in understandable terms to a human.

inter-rater agreement

A measurement of how often human raters agree when doing a task. If raters disagree, the task instructions may need to be improved. Also sometimes called inter-annotator agreement or inter-rater reliability . See also Cohen's kappa , which is one of the most popular inter-rater agreement measurements.

intersection over union (IoU)

#image

The intersection of two sets divided by their union. In machine-learning image-detection tasks, IoU is used to measure the accuracy of the model's predicted bounding box with respect to the ground-truth bounding box. In this case, the IoU for the two boxes is the ratio between the overlapping area and the total area, and its value ranges from 0 (no overlap of predicted bounding box and ground-truth bounding box) to 1 (predicted bounding box and ground-truth bounding box have the exact same coordinates).

For example, in the image below:

  • The predicted bounding box (the coordinates delimiting where the model predicts the night table in the painting is located) is outlined in purple.
  • The ground-truth bounding box (the coordinates delimiting where the night table in the painting is actually located) is outlined in green.

The Van Gogh painting 'Vincent's Bedroom in Arles', with two different
          bounding boxes around the night table beside the bed. The ground-truth
          bounding box (in green) perfectly circumscribes the night table. The
          predicted bounding box (in purple) is offset 50% down and to the right
          of the ground-truth bounding box; it encloses the bottom-right quarter
          of the night table, but misses the rest of the table.

Here, the intersection of the bounding boxes for prediction and ground truth (below left) is 1, and the union of the bounding boxes for prediction and ground truth (below right) is 7, so the IoU is \(\frac{1}{7}\).

Same image as above, but with each bounding box divided into four
          quadrants. There are seven quadrants total, as the bottom-right
          quadrant of the ground-truth bounding box and the top-left
          quadrant of the predicted bounding box overlap each other. This
          overlapping section (highlighted in green) represents the
          intersection, and has an area of 1.Same image as above, but with each bounding box divided into four
          quadrants. There are seven quadrants total, as the bottom-right
          quadrant of the ground-truth bounding box and the top-left
          quadrant of the predicted bounding box overlap each other.
          The entire interior enclosed by both bounding boxes
          (highlighted in green) represents the union, and has
          an area of 7.

IoU

Abbreviation for intersection over union .

item matrix

#recsystems

In recommendation systems , a matrix of embeddings generated by matrix factorization that holds latent signals about each item . Each row of the item matrix holds the value of a single latent feature for all items. For example, consider a movie recommendation system. Each column in the item matrix represents a single movie. The latent signals might represent genres, or might be harder-to-interpret signals that involve complex interactions among genre, stars, movie age, or other factors.

The item matrix has the same number of columns as the target matrix that is being factorized. For example, given a movie recommendation system that evaluates 10,000 movie titles, the item matrix will have 10,000 columns.

items

#recsystems

In a recommendation system , the entities that a system recommends. For example, videos are the items that a video store recommends, while books are the items that a bookstore recommends.

iteration

A single update of a model's weights during training. An iteration consists of computing the gradients of the parameters with respect to the loss on a single batch of data.

K

Keras

A popular Python machine learning API. Keras runs on several deep learning frameworks, including TensorFlow, where it is made available as tf.keras .

keypoints

#image

The coordinates of particular features in an image. For example, for an image recognition model that distinguishes flower species, keypoints might be the center of each petal, the stem, the stamen, and so on.

Kernel Support Vector Machines (KSVMs)

A classification algorithm that seeks to maximize the margin between positive and negative classes by mapping input data vectors to a higher dimensional space. For example, consider a classification problem in which the input dataset has a hundred features. To maximize the margin between positive and negative classes, a KSVM could internally map those features into a million-dimension space. KSVMs uses a loss function called hinge loss .

k-means

#clustering

A popular clustering algorithm that groups examples in unsupervised learning. The k-means algorithm basically does the following:

  • Iteratively determines the best k center points (known as centroids ).
  • Assigns each example to the closest centroid. Those examples nearest the same centroid belong to the same group.

The k-means algorithm picks centroid locations to minimize the cumulative square of the distances from each example to its closest centroid.

For example, consider the following plot of dog height to dog width:

A Cartesian plot with several dozen data points.

If k=3, the k-means algorithm will determine three centroids. Each example is assigned to its closest centroid, yielding three groups:

The same Cartesian plot as in the previous illustration, except
          with three centroids added.
          The previous data points are clustered into three distinct groups,
          with each group representing the data points closest to a particular
          centroid.

Imagine that a manufacturer wants to determine the ideal sizes for small, medium, and large sweaters for dogs. The three centroids identify the mean height and mean width of each dog in that cluster. So, the manufacturer should probably base sweater sizes on those three centroids. Note that the centroid of a cluster is typically not an example in the cluster.

The preceding illustrations shows k-means for examples with only two features (height and width). Note that k-means can group examples across many features.

k-median

#clustering

A clustering algorithm closely related to k-means . The practical difference between the two is as follows:

  • In k-means, centroids are determined by minimizing the sum of the squares of the distance between a centroid candidate and each of its examples.
  • In k-median, centroids are determined by minimizing the sum of the distance between a centroid candidate and each of its examples.

Note that the definitions of distance are also different:

  • k-means relies on the Euclidean distance from the centroid to an example. (In two dimensions, the Euclidean distance means using the Pythagorean theorem to calculate the hypotenuse.) For example, the k-means distance between (2,2) and (5,-2) would be:
$$ {\text{Euclidean distance}} = {\sqrt {(2-5)^2 + (2--2)^2}} = 5 $$
  • k-median relies on the Manhattan distance from the centroid to an example. This distance is the sum of the absolute deltas in each dimension. For example, the k-median distance between (2,2) and (5,-2) would be:
$$ {\text{Manhattan distance}} = \lvert 2-5 \rvert + \lvert 2--2 \rvert = 7 $$

л

L 1 loss

Loss function based on the absolute value of the difference between the values that a model is predicting and the actual values of the labels . L 1 loss is less sensitive to outliers than L 2 loss .

L 1 regularization

A type of regularization that penalizes weights in proportion to the sum of the absolute values of the weights. In models relying on sparse features , L 1 regularization helps drive the weights of irrelevant or barely relevant features to exactly 0, which removes those features from the model. Contrast with L 2 regularization .

L 2 loss

See squared loss .

L 2 regularization

A type of regularization that penalizes weights in proportion to the sum of the squares of the weights. L 2 regularization helps drive outlier weights (those with high positive or low negative values) closer to 0 but not quite to 0. (Contrast with L1 regularization .) L 2 regularization always improves generalization in linear models.

label

In supervised learning, the "answer" or "result" portion of an example . Each example in a labeled dataset consists of one or more features and a label. For instance, in a housing dataset, the features might include the number of bedrooms, the number of bathrooms, and the age of the house, while the label might be the house's price. In a spam detection dataset, the features might include the subject line, the sender, and the email message itself, while the label would probably be either "spam" or "not spam."

labeled example

An example that contains features and a label . In supervised training, models learn from labeled examples.

LaMDA (Language Model for Dialogue Applications)

#language

A Transformer -based large language model developed by Google trained on a large dialogue dataset that can generate realistic conversational responses.

LaMDA: our breakthrough conversation technology provides an overview.

lambda

Synonym for regularization rate .

(This is an overloaded term. Here we're focusing on the term's definition within regularization .)

landmarks

#image

Synonym for keypoints .

language model

#language

A model that estimates the probability of a token or sequence of tokens occurring in a longer sequence of tokens.

large language model

#language

An informal term with no strict definition that usually means a language model that has a high number of parameters . Some large language models contain over 100 billion parameters.

layer

A set of neurons in a neural network that process a set of input features, or the output of those neurons.

Also, an abstraction in TensorFlow. Layers are Python functions that take Tensors and configuration options as input and produce other tensors as output.

Layers API (tf.layers)

#TensorFlow

A TensorFlow API for constructing a deep neural network as a composition of layers. The Layers API enables you to build different types of layers , such as:

The Layers API follows the Keras layers API conventions. That is, aside from a different prefix, all functions in the Layers API have the same names and signatures as their counterparts in the Keras layers API.

лист

#дф

Любая конечная точка в дереве решений . В отличие от условия лист не выполняет проверку. Скорее, лист – это возможное предсказание. Лист также является конечным узлом пути вывода .

Например, следующее дерево решений содержит три листа:

Дерево решений с двумя условиями, ведущими к трем листьям.

learning rate

A scalar used to train a model via gradient descent. During each iteration, the gradient descent algorithm multiplies the learning rate by the gradient. The resulting product is called the gradient step .

Learning rate is a key hyperparameter .

least squares regression

A linear regression model trained by minimizing L 2 Loss .

linear model

A model that assigns one weight per feature to make predictions . (Linear models also incorporate a bias .) By contrast, the relationship of weights to features in deep models is not one-to-one.

A linear model uses the following formula:

$$y' = b + w_1x_1 + w_2x_2 + … w_nx_n$$

куда:

  • \(y'\) is the raw prediction. (In certain kinds of linear models, this raw prediction will be further modified. For example, see logistic regression .)
  • \(b\) is the bias .
  • \(w\) is a weight , so \(w_1\) is the weight of the first feature, \(w_2\) is the weight of the second feature, and so on.
  • \(x\) is a feature , so \(x_1\) is the value of the first feature, \(x_2\) is the value of the second feature, and so on.

For example, suppose a linear model for three features learns the following bias and weights:

  • \(b\) = 7
  • \(w_1\) = -2.5
  • \(w_2\) = -1.2
  • \(w_3\) = 1.4

Therefore, given three features (\(x_1\), \(x_2\), and \(x_3\)), the linear model uses the following equation to generate each prediction:

$$y' = 7 + (-2.5)(x_1) + (-1.2)(x_2) + (1.4)(x_3)$$

Suppose a particular example contains the following values:

  • \(x_1\) = 4
  • \(x_2\) = -10
  • \(x_3\) = 5

Plugging those values into the formula yields a prediction for this example:

$$y' = 7 + (-2.5)(4) + (-1.2)(-10) + (1.4)(5)$$$$y' = 16$$

Linear models tend to be easier to analyze and train than deep models. However, deep models can model complex relationships between features.

Linear regression and logistic regression are two types of linear models. Linear models include not only models that use the linear equation but also a broader set of models that use the linear equation as part of the formula. For example, logistic regression post-processes the raw prediction (\(y'\)) to calculate the prediction.

linear regression

Using the raw output (\(y'\)) of a linear model as the actual prediction in a regression model . The goal of a regression problem is to make a real-valued prediction. For example, if the raw output (\(y'\)) of a linear model is 8.37, then the prediction is 8.37.

Contrast linear regression with logistic regression . Also, contrast regression with classification .

logistic regression

A classification model that uses a sigmoid function to convert a linear model's raw prediction (\(y'\)) into a value between 0 and 1. You can interpret the value between 0 and 1 in either of the following two ways:

  • As a probability that the example belongs to the positive class in a binary classification problem.
  • As a value to be compared against a classification threshold . If the value is equal to or above the classification threshold, the system classifies the example as the positive class. Conversely, if the value is below the given threshold, the system classifies the example as the negative class . For example, suppose the classification threshold is 0.82:
    • Imagine an example that produces a raw prediction (\(y'\)) of 2.6. The sigmoid of 2.6 is 0.93. Since 0.93 is greater than 0.82, the system classifies this example as the positive class.
    • Imagine a different example that produces a raw prediction of 1.3. The sigmoid of 1.3 is 0.79. Since 0.79 is less than 0.82, the system classifies that example as the negative class.

Although logistic regression is often used in binary classification problems, logistic regression can also be used in multi-class classification problems (where it becomes called multi-class logistic regression or multinomial regression ).

logits

The vector of raw (non-normalized) predictions that a classification model generates, which is ordinarily then passed to a normalization function. If the model is solving a multi-class classification problem, logits typically become an input to the softmax function. The softmax function then generates a vector of (normalized) probabilities with one value for each possible class.

In addition, logits sometimes refer to the element-wise inverse of the sigmoid function . For more information, see tf.nn.sigmoid_cross_entropy_with_logits .

Log Loss

The loss function used in binary logistic regression .

log-odds

The logarithm of the odds of some event.

If the event refers to a binary probability, then odds refers to the ratio of the probability of success (p) to the probability of failure (1-p). For example, suppose that a given event has a 90% probability of success and a 10% probability of failure. In this case, odds is calculated as follows:

$$ {\text{odds}} = \frac{\text{p}} {\text{(1-p)}} = \frac{.9} {.1} = {\text{9}} $$

The log-odds is simply the logarithm of the odds. By convention, "logarithm" refers to natural logarithm, but logarithm could actually be any base greater than 1. Sticking to convention, the log-odds of our example is therefore:

$$ {\text{log-odds}} = ln(9) ~= 2.2 $$

The log-odds are the inverse of the sigmoid function .

Long Short-Term Memory (LSTM)

#seq

A type of cell in a recurrent neural network used to process sequences of data in applications such as handwriting recognition, machine translation, and image captioning. LSTMs address the vanishing gradient problem that occurs when training RNNs due to long data sequences by maintaining history in an internal memory state based on new input and context from previous cells in the RNN.

loss

A measure of how far a model's predictions are from its label . Or, to phrase it more pessimistically, a measure of how bad the model is. To determine this value, a model must define a loss function. For example, linear regression models typically use mean squared error for a loss function, while logistic regression models use Log Loss .

loss curve

A graph of loss as a function of training iterations . Например:

A Cartesian graph of loss versus training iterations, showing a
          steady drop as iterations increase, but then a slight rise in loss
          at a high number of iterations.

The loss curve can help you determine when your model is converging , overfitting , or underfitting .

loss surface

A graph of weight(s) vs. loss. Gradient descent aims to find the weight(s) for which the loss surface is at a local minimum.

LSTM

#seq

Abbreviation for Long Short-Term Memory .

M

machine learning

A program or system that builds (trains) a predictive model from input data. The system uses the learned model to make useful predictions from new (never-before-seen) data drawn from the same distribution as the one used to train the model. Machine learning also refers to the field of study concerned with these programs or systems.

majority class

The more common label in a class-imbalanced dataset . For example, given a dataset containing 99% non-spam labels and 1% spam labels, the non-spam labels are the majority class.

Markov decision process (MDP)

#rl

A graph representing the decision-making model where decisions (or actions ) are taken to navigate a sequence of states under the assumption that the Markov property holds. In reinforcement learning, these transitions between states return a numerical reward .

Markov property

#rl

A property of certain environments , where state transitions are entirely determined by information implicit in the current state and the agent's action .

masked language model

#language

A language model that predicts the probability of candidate tokens to fill in blanks in a sequence. For instance, a masked language model can calculate probabilities for candidate word(s) to replace the underline in the following sentence:

The ____ in the hat came back.

The literature typically uses the string "MASK" instead of an underline. Например:

The "MASK" in the hat came back.

Most modern masked language models are bidirectional .

matplotlib

An open-source Python 2D plotting library. matplotlib helps you visualize different aspects of machine learning.

matrix factorization

#recsystems

In math, a mechanism for finding the matrices whose dot product approximates a target matrix.

In recommendation systems , the target matrix often holds users' ratings on items . For example, the target matrix for a movie recommendation system might look something like the following, where the positive integers are user ratings and 0 means that the user didn't rate the movie:

Casablanca The Philadelphia Story Black Panther Wonder Woman Pulp Fiction
User 1 5.0 3.0 0.0 2.0 0.0
User 2 4.0 0.0 0.0 1,0 5.0
User 3 3.0 1,0 4.0 5.0 0.0

The movie recommendation system aims to predict user ratings for unrated movies. For example, will User 1 like Black Panther ?

One approach for recommendation systems is to use matrix factorization to generate the following two matrices:

  • A user matrix , shaped as the number of users X the number of embedding dimensions.
  • An item matrix , shaped as the number of embedding dimensions X the number of items.

For example, using matrix factorization on our three users and five items could yield the following user matrix and item matrix:

User Matrix                 Item Matrix

1.1   2.3           0.9   0.2   1.4    2.0   1.2
0.6   2.0           1.7   1.2   1.2   -0.1   2.1
2.5   0.5

The dot product of the user matrix and item matrix yields a recommendation matrix that contains not only the original user ratings but also predictions for the movies that each user hasn't seen. For example, consider User 1's rating of Casablanca , which was 5.0. The dot product corresponding to that cell in the recommendation matrix should hopefully be around 5.0, and it is:

(1.1 * 0.9) + (2.3 * 1.7) = 4.9

More importantly, will User 1 like Black Panther ? Taking the dot product corresponding to the first row and the third column yields a predicted rating of 4.3:

(1.1 * 1.4) + (2.3 * 1.2) = 4.3

Matrix factorization typically yields a user matrix and item matrix that, together, are significantly more compact than the target matrix.

Mean Absolute Error (MAE)

An error metric calculated by taking an average of absolute errors. In the context of evaluating a model's accuracy, MAE is the average absolute difference between the expected and predicted values across all training examples. Specifically, for $n$ examples, for each value $y$ and its prediction $\hat{y}$, MAE is defined as follows:

\[\text{MAE} = \frac{1}{n}\sum_{i=0}^n | y_i - \hat{y}_i |\]

Mean Squared Error (MSE)

The average squared loss per example. MSE is calculated by dividing the squared loss by the number of examples . The values that TensorFlow Playground displays for "Training loss" and "Test loss" are MSE.

metric

#TensorFlow

A number that you care about. May or may not be directly optimized in a machine-learning system. A metric that your system tries to optimize is called an objective .

meta-learning

#language

A subset of machine learning that discovers or improves a learning algorithm. A meta-learning system can also aim to train a model to quickly learn a new task from a small amount of data or from experience gained in previous tasks. Meta-learning algorithms generally try to achieve the following:

  • Improve/learn hand-engineered features (such as an initializer or an optimizer).
  • Be more data-efficient and compute-efficient.
  • Improve generalization.

Meta-learning is related to few-shot learning .

Metrics API (tf.metrics)

A TensorFlow API for evaluating models. For example, tf.metrics.accuracy determines how often a model's predictions match labels.

mini-batch

A small, randomly selected subset of the entire batch of examples run together in a single iteration of training or inference. The batch size of a mini-batch is usually between 10 and 1,000. It is much more efficient to calculate the loss on a mini-batch than on the full training data.

mini-batch stochastic gradient descent

A gradient descent algorithm that uses mini-batches . In other words, mini-batch stochastic gradient descent estimates the gradient based on a small subset of the training data. Regular stochastic gradient descent uses a mini-batch of size 1.

minimax loss

A loss function for generative adversarial networks , based on the cross-entropy between the distribution of generated data and real data.

Minimax loss is used in the first paper to describe generative adversarial networks.

minority class

The less common label in a class-imbalanced dataset . For example, given a dataset containing 99% non-spam labels and 1% spam labels, the spam labels are the minority class.

ML

Abbreviation for machine learning .

MNIST

#image

A public-domain dataset compiled by LeCun, Cortes, and Burges containing 60,000 images, each image showing how a human manually wrote a particular digit from 0–9. Each image is stored as a 28x28 array of integers, where each integer is a grayscale value between 0 and 255, inclusive.

MNIST is a canonical dataset for machine learning, often used to test new machine learning approaches. For details, see The MNIST Database of Handwritten Digits .

modality

#language

A high-level data category. For example, numbers, text, images, video, and audio are five different modalities.

model

The representation of what a machine learning system has learned from the training data. Within TensorFlow, model is an overloaded term, which can have either of the following two related meanings:

  • The TensorFlow graph that expresses the structure of how a prediction will be computed.
  • The particular weights and biases of that TensorFlow graph, which are determined by training .

model capacity

The complexity of problems that a model can learn. The more complex the problems that a model can learn, the higher the model's capacity. A model's capacity typically increases with the number of model parameters. For a formal definition of classifier capacity, see VC dimension .

model parallelism

#language

A way of scaling training or inference that puts different parts of one model on different devices. Model parallelism enables models that are too big to fit on a single device.

See also data parallelism .

model training

The process of determining the best model .

Momentum

A sophisticated gradient descent algorithm in which a learning step depends not only on the derivative in the current step, but also on the derivatives of the step(s) that immediately preceded it. Momentum involves computing an exponentially weighted moving average of the gradients over time, analogous to momentum in physics. Momentum sometimes prevents learning from getting stuck in local minima.

multi-class classification

Classification problems that distinguish among more than two classes. For example, there are approximately 128 species of maple trees, so a model that categorized maple tree species would be multi-class. Conversely, a model that divided emails into only two categories ( spam and not spam ) would be a binary classification model .

multi-class logistic regression

Using logistic regression in multi-class classification problems.

multi-head self-attention

#language

An extension of self-attention that applies the self-attention mechanism multiple times for each position in the input sequence.

Transformers introduced multi-head self-attention.

multimodal model

#language

A model whose inputs and/or outputs include more than one modality . For example, consider a model that takes both an image and a text caption (two modalities) as features , and outputs a score indicating how appropriate the text caption is for the image. So, this model's inputs are multimodal and the output is unimodal.

multinomial classification

Synonym for multi-class classification .

multinomial regression

Synonym for multi-class logistic regression .

Н

NaN trap

When one number in your model becomes a NaN during training, which causes many or all other numbers in your model to eventually become a NaN.

NaN is an abbreviation for "Not a Number."

natural language understanding

#language

Determining a user's intentions based on what the user typed or said. For example, a search engine uses natural language understanding to determine what the user is searching for based on what the user typed or said.

negative class

In binary classification , one class is termed positive and the other is termed negative. The positive class is the thing we're looking for and the negative class is the other possibility. For example, the negative class in a medical test might be "not tumor." The negative class in an email classifier might be "not spam." See also positive class .

neural network

A model that, taking inspiration from the brain, is composed of layers (at least one of which is hidden ) consisting of simple connected units or neurons followed by nonlinearities.

neuron

A node in a neural network , typically taking in multiple input values and generating one output value. The neuron calculates the output value by applying an activation function (nonlinear transformation) to a weighted sum of input values.

N-gram

#seq
#language

An ordered sequence of N words. For example, truly madly is a 2-gram. Because order is relevant, madly truly is a different 2-gram than truly madly .

Н Name(s) for this kind of N-gram Примеры
2 bigram or 2-gram to go, go to, eat lunch, eat dinner
3 trigram or 3-gram ate too much, three blind mice, the bell tolls
4 4-gram walk in the park, dust in the wind, the boy ate lentils

Many natural language understanding models rely on N-grams to predict the next word that the user will type or say. For example, suppose a user typed three blind . An NLU model based on trigrams would likely predict that the user will next type mice .

Contrast N-grams with bag of words , which are unordered sets of words.

NLU

#language

Abbreviation for natural language understanding .

node (neural network)

A neuron in a hidden layer .

node (TensorFlow graph)

#TensorFlow

An operation in a TensorFlow graph .

узел (дерево решений)

#дф

В дереве решений любое условие или лист .

Дерево решений с двумя условиями и тремя листьями.

noise

Broadly speaking, anything that obscures the signal in a dataset. Noise can be introduced into data in a variety of ways. Например:

  • Human raters make mistakes in labeling.
  • Humans and instruments mis-record or omit feature values.

небинарное условие

#дф

Условие , содержащее более двух возможных исходов. Например, следующее небинарное условие содержит три возможных исхода:

Условие (number_of_legs = ?), приводящее к трем возможным исходам. Один результат (number_of_legs = 8) приводит к листу с именем паук. Второй результат (number_of_legs = 4) приводит к листу с именем собака. Третий результат (number_of_legs = 2) приводит к листу с именем пингвин.

non-response bias

#fairness

See selection bias .

nonstationarity

A feature whose values change across one or more dimensions, usually time. For example, the number of swimsuits sold at a particular store demonstrates nonstationarity because that number varies with the season. As a second example, the quantity of a particular fruit harvested in a particular region typically shows sharp nonstationarity over time.

normalization

The process of converting an actual range of values into a standard range of values, typically -1 to +1 or 0 to 1. For example, suppose the natural range of a certain feature is 800 to 6,000. Through subtraction and division, you can normalize those values into the range -1 to +1.

See also scaling .

novelty detection

The process of determining whether a new (novel) example comes from the same distribution as the training set . In other words, after training on the training set, novelty detection determines whether a new example (during inference or during additional training) is an outlier .

Contrast with outlier detection .

numerical data

Features represented as integers or real-valued numbers. For example, in a real estate model, you would probably represent the size of a house (in square feet or square meters) as numerical data. Representing a feature as numerical data indicates that the feature's values have a mathematical relationship to each other and possibly to the label. For example, representing the size of a house as numerical data indicates that a 200 square-meter house is twice as large as a 100 square-meter house. Furthermore, the number of square meters in a house probably has some mathematical relationship to the price of the house.

Not all integer data should be represented as numerical data. For example, postal codes in some parts of the world are integers; however, integer postal codes should not be represented as numerical data in models. That's because a postal code of 20000 is not twice (or half) as potent as a postal code of 10000. Furthermore, although different postal codes do correlate to different real estate values, we can't assume that real estate values at postal code 20000 are twice as valuable as real estate values at postal code 10000. Postal codes should be represented as categorical data instead.

Numerical features are sometimes called continuous features .

NumPy

An open-source math library that provides efficient array operations in Python. pandas is built on NumPy.

О

objective

A metric that your algorithm is trying to optimize.

objective function

The mathematical formula or metric that a model aims to optimize. For example, the objective function for linear regression is usually squared loss . Therefore, when training a linear regression model, the goal is to minimize squared loss.

In some cases, the goal is to maximize the objective function. For example, if the objective function is accuracy, the goal is to maximize accuracy.

See also loss .

косое состояние

#дф

В дереве решений - условие , включающее более одного признака . Например, если и высота, и ширина являются функциями, то следующее условие является косым:

  height > width

В отличие от условия выравнивания по оси .

offline inference

Generating a group of predictions , storing those predictions, and then retrieving those predictions on demand. Contrast with online inference .

one-hot encoding

A sparse vector in which:

  • One element is set to 1.
  • All other elements are set to 0.

One-hot encoding is commonly used to represent strings or identifiers that have a finite set of possible values. For example, suppose a given botany dataset chronicles 15,000 different species, each denoted with a unique string identifier. As part of feature engineering, you'll probably encode those string identifiers as one-hot vectors in which the vector has a size of 15,000.

one-shot learning

A machine learning approach, often used for object classification, designed to learn effective classifiers from a single training example.

See also few-shot learning .

one-vs.-all

Given a classification problem with N possible solutions, a one-vs.-all solution consists of N separate binary classifiers —one binary classifier for each possible outcome. For example, given a model that classifies examples as animal, vegetable, or mineral, a one-vs.-all solution would provide the following three separate binary classifiers:

  • animal vs. not animal
  • vegetable vs. not vegetable
  • mineral vs. not mineral

online inference

Generating predictions on demand. Contrast with offline inference .

Operation (op)

#TensorFlow

A node in the TensorFlow graph. In TensorFlow, any procedure that creates, manipulates, or destroys a Tensor is an operation. For example, a matrix multiply is an operation that takes two Tensors as input and generates one Tensor as output.

внеплановая оценка (оценка OOB)

#дф

Механизм оценки качества леса решений путем проверки каждого дерева решений на примерах , не использованных при обучении этого дерева решений. Например, на следующей диаграмме обратите внимание, что система обучает каждое дерево решений примерно на двух третях примеров, а затем оценивает оставшуюся одну треть примеров.

Лес решений, состоящий из трех деревьев решений. Одно дерево решений обучается на двух третях примеров, а затем использует оставшуюся треть для оценки OOB. Второе дерево решений обучается на двух третях примеров, отличных от предыдущего дерева решений, а затем использует другую треть для оценки OOB, чем предыдущее дерево решений.

Нестандартная оценка — это вычислительно эффективная и консервативная аппроксимация механизма перекрестной проверки . При перекрестной проверке одна модель обучается для каждого раунда перекрестной проверки (например, 10 моделей обучаются при 10-кратной перекрестной проверке). При OOB-оценке обучается одна модель. Поскольку пакетирование удерживает некоторые данные от каждого дерева во время обучения, оценка OOB может использовать эти данные для аппроксимации перекрестной проверки.

optimizer

A specific implementation of the gradient descent algorithm. Popular optimizers include:

  • [ AdaGrad ], which stands for ADAptive GRADient descent.
  • Adam, which stands for ADAptive with Momentum.

out-group homogeneity bias

#fairness

The tendency to see out-group members as more alike than in-group members when comparing attitudes, values, personality traits, and other characteristics. In-group refers to people you interact with regularly; out-group refers to people you do not interact with regularly. If you create a dataset by asking people to provide attributes about out-groups, those attributes may be less nuanced and more stereotyped than attributes that participants list for people in their in-group.

For example, Lilliputians might describe the houses of other Lilliputians in great detail, citing small differences in architectural styles, windows, doors, and sizes. However, the same Lilliputians might simply declare that Brobdingnagians all live in identical houses.

Out-group homogeneity bias is a form of group attribution bias .

See also in-group bias .

outlier detection

The process of identifying outliers in a training set .

Contrast with novelty detection .

outliers

Values distant from most other values. In machine learning, any of the following are outliers:

  • Weights with high absolute values.
  • Predicted values relatively far away from the actual values.
  • Input data whose values are more than roughly 3 standard deviations from the mean.

Outliers often cause problems in model training. Clipping is one way of managing outliers.

output layer

The "final" layer of a neural network. The layer containing the answer(s).

overfitting

Creating a model that matches the training data so closely that the model fails to make correct predictions on new data.

oversampling

Reusing the examples of a minority class in a class-imbalanced dataset in order to create a more balanced training set .

For example, consider a binary classification problem in which the ratio of the majority class to the minority class is 5,000:1. If the dataset contains a million examples, then the dataset contains only about 200 examples of the minority class, which might be too few examples for effective training. To overcome this deficiency, you might oversample (reuse) those 200 examples multiple times, possibly yielding sufficient examples for useful training.

You need to be careful about over overfitting when oversampling.

Contrast with undersampling .

п

pandas

A column-oriented data analysis API. Many machine learning frameworks, including TensorFlow, support pandas data structures as input. See the pandas documentation for details.

parameter

A variable of a model that the machine learning system trains on its own. For example, weights are parameters whose values the machine learning system gradually learns through successive training iterations. Contrast with hyperparameter .

Parameter Server (PS)

#TensorFlow

A job that keeps track of a model's parameters in a distributed setting.

parameter update

The operation of adjusting a model's parameters during training, typically within a single iteration of gradient descent .

partial derivative

A derivative in which all but one of the variables is considered a constant. For example, the partial derivative of f(x, y) with respect to x is the derivative of f considered as a function of x alone (that is, keeping y constant). The partial derivative of f with respect to x focuses only on how x is changing and ignores all other variables in the equation.

participation bias

#fairness

Synonym for non-response bias. See selection bias .

partitioning strategy

The algorithm by which variables are divided across parameter servers .

perceptron

A system (either hardware or software) that takes in one or more input values, runs a function on the weighted sum of the inputs, and computes a single output value. In machine learning, the function is typically nonlinear, such as ReLU , sigmoid , or tanh. For example, the following perceptron relies on the sigmoid function to process three input values:

$$f(x_1, x_2, x_3) = \text{sigmoid}(w_1 x_1 + w_2 x_2 + w_3 x_3)$$

In the following illustration, the perceptron takes three inputs, each of which is itself modified by a weight before entering the perceptron:

A perceptron that takes in 3 inputs, each multiplied by separate
          weights. The perceptron outputs a single value.

Perceptrons are the ( nodes ) in deep neural networks . That is, a deep neural network consists of multiple connected perceptrons, plus a backpropagation algorithm to introduce feedback.

performance

Overloaded term with the following meanings:

  • The traditional meaning within software engineering. Namely: How fast (or efficiently) does this piece of software run?
  • The meaning within machine learning. Here, performance answers the following question: How correct is this model ? That is, how good are the model's predictions?

значение переменной перестановки

#дф

Тип важности переменной , который оценивает увеличение ошибки прогнозирования модели после перестановки значений функции. Важность переменной перестановки является метрикой, не зависящей от модели.

perplexity

One measure of how well a model is accomplishing its task. For example, suppose your task is to read the first few letters of a word a user is typing on a smartphone keyboard, and to offer a list of possible completion words. Perplexity, P, for this task is approximately the number of guesses you need to offer in order for your list to contain the actual word the user is trying to type.

Perplexity is related to cross-entropy as follows:

$$P= 2^{-\text{cross entropy}}$$

pipeline

The infrastructure surrounding a machine learning algorithm. A pipeline includes gathering the data, putting the data into training data files, training one or more models, and exporting the models to production.

pipelining

#language

A form of model parallelism in which a model's processing is divided into consecutive stages and each stage is executed on a different device. While a stage is processing one batch, the preceding stage can work on the next batch.

See also staged training .

policy

#rl

In reinforcement learning, an agent's probabilistic mapping from states to actions .

pooling

#image

Reducing a matrix (or matrices) created by an earlier convolutional layer to a smaller matrix. Pooling usually involves taking either the maximum or average value across the pooled area. For example, suppose we have the following 3x3 matrix:

The 3x3 matrix [[5,3,1], [8,2,5], [9,4,3]].

A pooling operation, just like a convolutional operation, divides that matrix into slices and then slides that convolutional operation by strides . For example, suppose the pooling operation divides the convolutional matrix into 2x2 slices with a 1x1 stride. As the following diagram illustrates, four pooling operations take place. Imagine that each pooling operation picks the maximum value of the four in that slice:

The input matrix is 3x3 with the values: [[5,3,1], [8,2,5], [9,4,3]].
          The top-left 2x2 submatrix of the input matrix is [[5,3], [8,2]], so
          the top-left pooling operation yields the value 8 (which is the
          maximum of 5, 3, 8, and 2). The top-right 2x2 submatrix of the input
          matrix is [[3,1], [2,5]], so the top-right pooling operation yields
          the value 5. The bottom-left 2x2 submatrix of the input matrix is
          [[8,2], [9,4]], so the bottom-left pooling operation yields the value
          9.  The bottom-right 2x2 submatrix of the input matrix is
          [[2,5], [4,3]], so the bottom-right pooling operation yields the value
          5.  In summary, the pooling operation yields the 2x2 matrix
          [[8,5], [9,5]].

Pooling helps enforce translational invariance in the input matrix.

Pooling for vision applications is known more formally as spatial pooling . Time-series applications usually refer to pooling as temporal pooling . Less formally, pooling is often called subsampling or downsampling .

positive class

In binary classification , the two possible classes are labeled as positive and negative. The positive outcome is the thing we're testing for. (Admittedly, we're simultaneously testing for both outcomes, but play along.) For example, the positive class in a medical test might be "tumor." The positive class in an email classifier might be "spam."

Contrast with negative class .

post-processing

#fairness
Processing the output of a model after the model has been run. Post-processing can be used to enforce fairness constraints without modifying models themselves.

For example, one might apply post-processing to a binary classifier by setting a classification threshold such that equality of opportunity is maintained for some attribute by checking that the true positive rate is the same for all values of that attribute.

PR AUC (area under the PR curve)

Area under the interpolated precision-recall curve , obtained by plotting (recall, precision) points for different values of the classification threshold . Depending on how it's calculated, PR AUC may be equivalent to the average precision of the model.

точность

A metric for classification models . Precision identifies the frequency with which a model was correct when predicting the positive class . That is:

$$\text{Precision} = \frac{\text{True Positives}} {\text{True Positives} + \text{False Positives}}$$

precision-recall curve

A curve of precision vs. recall at different classification thresholds .

prediction

A model's output when provided with an input example .

prediction bias

A value indicating how far apart the average of predictions is from the average of labels in the dataset.

Not to be confused with the bias term in machine learning models or with bias in ethics and fairness .

predictive parity

#fairness

A fairness metric that checks whether, for a given classifier, the precision rates are equivalent for subgroups under consideration.

For example, a model that predicts college acceptance would satisfy predictive parity for nationality if its precision rate is the same for Lilliputians and Brobdingnagians.

Predictive parity is sometime also called predictive rate parity .

See "Fairness Definitions Explained" (section 3.2.1) for a more detailed discussion of predictive parity.

predictive rate parity

#fairness

Another name for predictive parity .

preprocessing

#fairness
Processing data before it's used to train a model. Preprocessing could be as simple as removing words from an English text corpus that don't occur in the English dictionary, or could be as complex as re-expressing data points in a way that eliminates as many attributes that are correlated with sensitive attributes as possible. Preprocessing can help satisfy fairness constraints .

pre-trained model

Models or model components (such as embeddings ) that have been already been trained. Sometimes, you'll feed pre-trained embeddings into a neural network . Other times, your model will train the embeddings itself rather than rely on the pre-trained embeddings.

prior belief

What you believe about the data before you begin training on it. For example, L 2 regularization relies on a prior belief that weights should be small and normally distributed around zero.

probabilistic regression model

A regression model that uses not only the weights for each feature , but also the uncertainty of those weights. A probabilistic regression model generates a prediction and the uncertainty of that prediction. For example, a probabilistic regression model might yield a prediction of 325 with a standard deviation of 12. For more information about probabilistic regression models, see this Colab on tensorflow.org .

proxy (sensitive attributes)

#fairness
An attribute used as a stand-in for a sensitive attribute . For example, an individual's postal code might be used as a proxy for their income, race, or ethnicity.

proxy labels

Data used to approximate labels not directly available in a dataset.

For example, suppose you want is it raining? to be a Boolean label for your dataset, but the dataset doesn't contain rain data. If photographs are available, you might establish pictures of people carrying umbrellas as a proxy label for is it raining? However, proxy labels may distort results. For example, in some places, it may be more common to carry umbrellas to protect against sun than the rain.

Q

Q-function

#rl

In reinforcement learning, the function that predicts the expected return from taking an action in a state and then following a given policy .

Q-function is also known as state-action value function .

Q-learning

#rl

In reinforcement learning, an algorithm that allows an agent to learn the optimal Q-function of a Markov decision process by applying the Bellman equation . The Markov decision process models an environment .

quantile

Each bucket in quantile bucketing .

quantile bucketing

Distributing a feature's values into buckets so that each bucket contains the same (or almost the same) number of examples. For example, the following figure divides 44 points into 4 buckets, each of which contains 11 points. In order for each bucket in the figure to contain the same number of points, some buckets span a different width of x-values.

44 data points divided into 4 buckets of 11 points each.
          Although each bucket contains the same number of data points,
          some buckets contain a wider range of feature values than other
          buckets.

quantization

An algorithm that implements quantile bucketing on a particular feature in a dataset .

queue

#TensorFlow

A TensorFlow Operation that implements a queue data structure. Typically used in I/O.

р

случайный лес

#дф

Ансамбль деревьев решений, в котором каждое дерево решений обучается с помощью определенного случайного шума, такого как бэггинг .

Случайные леса — это тип леса решений .

random policy

#rl

In reinforcement learning, a policy that chooses an action at random.

ranking

A type of supervised learning whose objective is to order a list of items.

rank (ordinality)

The ordinal position of a class in a machine learning problem that categorizes classes from highest to lowest. For example, a behavior ranking system could rank a dog's rewards from highest (a steak) to lowest (wilted kale).

rank (Tensor)

#TensorFlow

The number of dimensions in a Tensor . For instance, a scalar has rank 0, a vector has rank 1, and a matrix has rank 2.

Not to be confused with rank (ordinality) .

rater

A human who provides labels in examples . Sometimes called an "annotator."

отзывать

A metric for classification models that answers the following question: Out of all the possible positive labels, how many did the model correctly identify? That is:

\[\text{Recall} = \frac{\text{True Positives}} {\text{True Positives} + \text{False Negatives}} \]

recommendation system

#recsystems

A system that selects for each user a relatively small set of desirable items from a large corpus. For example, a video recommendation system might recommend two videos from a corpus of 100,000 videos, selecting Casablanca and The Philadelphia Story for one user, and Wonder Woman and Black Panther for another. A video recommendation system might base its recommendations on factors such as:

  • Movies that similar users have rated or watched.
  • Genre, directors, actors, target demographic...

Rectified Linear Unit (ReLU)

An activation function with the following rules:

  • If input is negative or zero, output is 0.
  • If input is positive, output is equal to input.

recurrent neural network

#seq

A neural network that is intentionally run multiple times, where parts of each run feed into the next run. Specifically, hidden layers from the previous run provide part of the input to the same hidden layer in the next run. Recurrent neural networks are particularly useful for evaluating sequences, so that the hidden layers can learn from previous runs of the neural network on earlier parts of the sequence.

For example, the following figure shows a recurrent neural network that runs four times. Notice that the values learned in the hidden layers from the first run become part of the input to the same hidden layers in the second run. Similarly, the values learned in the hidden layer on the second run become part of the input to the same hidden layer in the third run. In this way, the recurrent neural network gradually trains and predicts the meaning of the entire sequence rather than just the meaning of individual words.

An RNN that runs four times to process four input words.

regression model

A type of model that outputs continuous (typically, floating-point) values. Compare with classification models , which output discrete values, such as "day lily" or "tiger lily."

regularization

The penalty on a model's complexity. Regularization helps prevent overfitting . Different kinds of regularization include:

regularization rate

A scalar value, represented as lambda, specifying the relative importance of the regularization function. The following simplified loss equation shows the regularization rate's influence:

$$\text{minimize(loss function + }\lambda\text{(regularization function))}$$

Raising the regularization rate reduces overfitting but may make the model less accurate .

reinforcement learning (RL)

#rl

A family of algorithms that learn an optimal policy , whose goal is to maximize return when interacting with an environment . For example, the ultimate reward of most games is victory. Reinforcement learning systems can become expert at playing complex games by evaluating sequences of previous game moves that ultimately led to wins and sequences that ultimately led to losses.

replay buffer

#rl

In DQN -like algorithms, the memory used by the agent to store state transitions for use in experience replay .

reporting bias

#fairness

The fact that the frequency with which people write about actions, outcomes, or properties is not a reflection of their real-world frequencies or the degree to which a property is characteristic of a class of individuals. Reporting bias can influence the composition of data that machine learning systems learn from.

For example, in books, the word laughed is more prevalent than breathed . A machine learning model that estimates the relative frequency of laughing and breathing from a book corpus would probably determine that laughing is more common than breathing.

representation

The process of mapping data to useful features .

re-ranking

#recsystems

The final stage of a recommendation system , during which scored items may be re-graded according to some other (typically, non-ML) algorithm. Re-ranking evaluates the list of items generated by the scoring phase, taking actions such as:

  • Eliminating items that the user has already purchased.
  • Boosting the score of fresher items.

return

#rl

In reinforcement learning, given a certain policy and a certain state, the return is the sum of all rewards that the agent expects to receive when following the policy from the state to the end of the episode . The agent accounts for the delayed nature of expected rewards by discounting rewards according to the state transitions required to obtain the reward.

Therefore, if the discount factor is \(\gamma\), and \(r_0, \ldots, r_{N}\)denote the rewards until the end of the episode, then the return calculation is as follows:

$$\text{Return} = r_0 + \gamma r_1 + \gamma^2 r_2 + \ldots + \gamma^{N-1} r_{N-1}$$

reward

#rl

In reinforcement learning, the numerical result of taking an action in a state , as defined by the environment .

ridge regularization

Synonym for L 2 regularization . The term ridge regularization is more frequently used in pure statistics contexts, whereas L 2 regularization is used more often in machine learning.

RNN

#seq

Abbreviation for recurrent neural networks .

ROC (receiver operating characteristic) Curve

A curve of true positive rate vs. false positive rate at different classification thresholds . See also AUC .

корень

#дф

Начальный узел (первое условие ) в дереве решений . По соглашению на диаграммах корень находится наверху дерева решений. Например:

Дерево решений с двумя условиями и тремя листьями. Начальное условие (x > 2) — корень.

root directory

#TensorFlow

The directory you specify for hosting subdirectories of the TensorFlow checkpoint and events files of multiple models.

Root Mean Squared Error (RMSE)

The square root of the Mean Squared Error .

rotational invariance

#image

In an image classification problem, an algorithm's ability to successfully classify images even when the orientation of the image changes. For example, the algorithm can still identify a tennis racket whether it is pointing up, sideways, or down. Note that rotational invariance is not always desirable; for example, an upside-down 9 should not be classified as a 9.

See also translational invariance and size invariance .

С

sampling bias

#fairness

See selection bias .

выборка с заменой

#дф

Метод выбора элементов из набора элементов-кандидатов, в котором один и тот же элемент может быть выбран несколько раз. Фраза «с заменой» означает, что после каждого выбора выбранный элемент возвращается в пул элементов-кандидатов. Обратный метод, выборка без замены , означает, что элемент-кандидат может быть выбран только один раз.

Например, рассмотрим следующий набор фруктов:

fruit = {kiwi, apple, pear, fig, cherry, lime, mango}

Предположим, что система случайным образом выбирает fig в качестве первого элемента. Если используется выборка с замещением, то система выбирает второй элемент из следующего набора:

fruit = {kiwi, apple, pear, fig, cherry, lime, mango}

Да, это тот же набор, что и раньше, так что потенциально система может снова взять fig .

При использовании выборки без замены после отбора образец не может быть отобран снова. Например, если система случайным образом выбирает fig в качестве первого образца, то fig не может быть выбран повторно. Поэтому система выбирает вторую выборку из следующего (сокращенного) набора:

fruit = {kiwi, apple, pear, cherry, lime, mango}

SavedModel

#TensorFlow

The recommended format for saving and recovering TensorFlow models. SavedModel is a language-neutral, recoverable serialization format, which enables higher-level systems and tools to produce, consume, and transform TensorFlow models.

See the Saving and Restoring chapter in the TensorFlow Programmer's Guide for complete details.

Saver

#TensorFlow

A TensorFlow object responsible for saving model checkpoints.

scalar

A single number or a single string that can be represented as a tensor of rank 0. For example, the following lines of code each create one scalar in TensorFlow:

breed = tf.Variable("poodle", tf.string)
temperature = tf.Variable(27, tf.int16)
precision = tf.Variable(0.982375101275, tf.float64)

scaling

A commonly used practice in feature engineering to tame a feature's range of values to match the range of other features in the dataset. For example, suppose that you want all floating-point features in the dataset to have a range of 0 to 1. Given a particular feature's range of 0 to 500, you could scale that feature by dividing each value by 500.

See also normalization .

scikit-learn

A popular open-source machine learning platform. See scikit-learn.org .

scoring

#recsystems

The part of a recommendation system that provides a value or ranking for each item produced by the candidate generation phase.

selection bias

#fairness

Errors in conclusions drawn from sampled data due to a selection process that generates systematic differences between samples observed in the data and those not observed. The following forms of selection bias exist:

  • coverage bias : The population represented in the dataset does not match the population that the machine learning model is making predictions about.
  • sampling bias : Data is not collected randomly from the target group.
  • non-response bias (also called participation bias ): Users from certain groups opt-out of surveys at different rates than users from other groups.

For example, suppose you are creating a machine learning model that predicts people's enjoyment of a movie. To collect training data, you hand out a survey to everyone in the front row of a theater showing the movie. Offhand, this may sound like a reasonable way to gather a dataset; however, this form of data collection may introduce the following forms of selection bias:

  • coverage bias: By sampling from a population who chose to see the movie, your model's predictions may not generalize to people who did not already express that level of interest in the movie.
  • sampling bias: Rather than randomly sampling from the intended population (all the people at the movie), you sampled only the people in the front row. It is possible that the people sitting in the front row were more interested in the movie than those in other rows.
  • non-response bias: In general, people with strong opinions tend to respond to optional surveys more frequently than people with mild opinions. Since the movie survey is optional, the responses are more likely to form a bimodal distribution than a normal (bell-shaped) distribution.

self-attention (also called self-attention layer)

#language

A neural network layer that transforms a sequence of embeddings (for instance, token embeddings) into another sequence of embeddings. Each embedding in the output sequence is constructed by integrating information from the elements of the input sequence through an attention mechanism.

The self part of self-attention refers to the sequence attending to itself rather than to some other context. Self-attention is one of the main building blocks for Transformers and uses dictionary lookup terminology, such as “query”, “key”, and “value”.

A self-attention layer starts with a sequence of input representations, one for each word. The input representation for a word can be a simple embedding. For each word in an input sequence, the network scores the relevance of the word to every element in the whole sequence of words. The relevance scores determine how much the word's final representation incorporates the representations of other words.

For example, consider the following sentence:

The animal didn't cross the street because it was too tired.

The following illustration (from Transformer: A Novel Neural Network Architecture for Language Understanding ) shows a self-attention layer's attention pattern for the pronoun it , with the darkness of each line indicating how much each word contributes to the representation:

The following sentence appears twice: 'The animal didn't cross the
          street because it was too tired.'  Lines connect the word 'it' in
          one sentence to five tokens ('The', 'animal', 'street', 'it', and
          the period) in the other sentence.  The line between 'it' and
          'animal' is strongest.

The self-attention layer highlights words that are relevant to "it". In this case, the attention layer has learned to highlight words that it might refer to, assigning the highest weight to animal .

For a sequence of n tokens , self-attention transforms a sequence of embeddings n separate times, once at each position in the sequence.

Refer also to attention and multi-head self-attention .

self-supervised learning

A family of techniques for converting an unsupervised machine learning problem into a supervised machine learning problem by creating surrogate labels from unlabeled examples .

Some Transformer -based models such as BERT use self-supervised learning.

Self-supervised training is a semi-supervised learning approach.

self-training

A variant of self-supervised learning that is particularly useful when all of the following conditions are true:

Self-training works by iterating over the following two steps until the model stops improving:

  1. Use supervised machine learning to train a model on the labeled examples.
  2. Use the model created in Step 1 to generate predictions (labels) on the unlabeled examples, moving those in which there is high confidence into the labeled examples with the predicted label.

Notice that each iteration of Step 2 adds more labeled examples for Step 1 to train on.

semi-supervised learning

Training a model on data where some of the training examples have labels but others don't. One technique for semi-supervised learning is to infer labels for the unlabeled examples, and then to train on the inferred labels to create a new model. Semi-supervised learning can be useful if labels are expensive to obtain but unlabeled examples are plentiful.

Self-training is one technique for semi-supervised learning.

sensitive attribute

#fairness
A human attribute that may be given special consideration for legal, ethical, social, or personal reasons.

sentiment analysis

#language

Using statistical or machine learning algorithms to determine a group's overall attitude—positive or negative—toward a service, product, organization, or topic. For example, using natural language understanding , an algorithm could perform sentiment analysis on the textual feedback from a university course to determine the degree to which students generally liked or disliked the course.

sequence model

#seq

A model whose inputs have a sequential dependence. For example, predicting the next video watched from a sequence of previously watched videos.

sequence-to-sequence task

#language

A task that converts an input sequence of tokens to an output sequence of tokens. For example, two popular kinds of sequence-to-sequence tasks are:

  • Translators:
    • Sample input sequence: "I love you."
    • Sample output sequence: "Je t'aime."
  • Question answering:
    • Sample input sequence: "Do I need my car in New York City?"
    • Sample output sequence: "No. Please keep your car at home."

serving

A synonym for inferring .

shape (Tensor)

The number of elements in each dimension of a tensor. The shape is represented as a list of integers. For example, the following two-dimensional tensor has a shape of [3,4]:

[[5, 7, 6, 4],
 [2, 9, 4, 8],
 [3, 6, 5, 1]]

TensorFlow uses row-major (C-style) format to represent the order of dimensions, which is why the shape in TensorFlow is [3,4] rather than [4,3]. In other words, in a two-dimensional TensorFlow Tensor, the shape is [ number of rows , number of columns ].

усадка

#дф

Гиперпараметр в повышении градиента , который контролирует переоснащение . Сокращение при повышении градиента аналогично скорости обучения при градиентном спуске . Усадка представляет собой десятичное значение от 0,0 до 1,0. Более низкое значение усадки уменьшает переоснащение в большей степени, чем большее значение усадки.

sigmoid function

A function that maps logistic or multinomial regression output (log odds) to probabilities, returning a value between 0 and 1. The sigmoid function has the following formula:

$$y = \frac{1}{1 + e^{-\sigma}}$$

where \(\sigma\) in logistic regression problems is simply:

$$\sigma = b + w_1x_1 + w_2x_2 + … w_nx_n$$

In other words, the sigmoid function converts \(\sigma\) into a probability between 0 and 1.

In some neural networks , the sigmoid function acts as the activation function .

similarity measure

#clustering

In clustering algorithms, the metric used to determine how alike (how similar) any two examples are.

size invariance

#image

In an image classification problem, an algorithm's ability to successfully classify images even when the size of the image changes. For example, the algorithm can still identify a cat whether it consumes 2M pixels or 200K pixels. Note that even the best image classification algorithms still have practical limits on size invariance. For example, an algorithm (or human) is unlikely to correctly classify a cat image consuming only 20 pixels.

See also translational invariance and rotational invariance .

sketching

#clustering

In unsupervised machine learning , a category of algorithms that perform a preliminary similarity analysis on examples. Sketching algorithms use a locality-sensitive hash function to identify points that are likely to be similar, and then group them into buckets.

Sketching decreases the computation required for similarity calculations on large datasets. Instead of calculating similarity for every single pair of examples in the dataset, we calculate similarity only for each pair of points within each bucket.

softmax

A function that provides probabilities for each possible class in a multi-class classification model . The probabilities add up to exactly 1.0. For example, softmax might determine that the probability of a particular image being a dog at 0.9, a cat at 0.08, and a horse at 0.02. (Also called full softmax .)

Contrast with candidate sampling .

sparse feature

Feature vector whose values are predominately zero or empty. For example, a vector containing a single 1 value and a million 0 values is sparse. As another example, words in a search query could also be a sparse feature—there are many possible words in a given language, but only a few of them occur in a given query.

Contrast with dense feature .

sparse representation

A representation of a tensor that only stores nonzero elements.

For example, the English language consists of about a million words. Consider two ways to represent a count of the words used in one English sentence:

  • A dense representation of this sentence must set an integer for all one million cells, placing a 0 in most of them, and a low integer into a few of them.
  • A sparse representation of this sentence stores only those cells symbolizing a word actually in the sentence. So, if the sentence contained only 20 unique words, then the sparse representation for the sentence would store an integer in only 20 cells.

For example, consider two ways to represent the sentence, "Dogs wag tails." As the following tables show, the dense representation consumes about a million cells; the sparse representation consumes only 3 cells:

Dense Representation
Cell Number Word Occurrence
0 a 0
1 aardvark 0
2 aargh 0
3 aarti 0
… 140,391 more words with an occurrence of 0
140395 dogs 1
… 633,062 words with an occurrence of 0
773458 tails 1
… 189,135 words with an occurrence of 0
962594 wag 1
… many more words with an occurrence of 0
Sparse Representation
Cell Number Word Occurrence
140395 dogs 1
773458 tails 1
962594 wag 1

sparse vector

A vector whose values are mostly zeroes. See also sparse feature .

sparsity

The number of elements set to zero (or null) in a vector or matrix divided by the total number of entries in that vector or matrix. For example, consider a 10x10 matrix in which 98 cells contain zero. The calculation of sparsity is as follows:

$$ {\text{sparsity}} = \frac{\text{98}} {\text{100}} = {\text{0.98}} $$

Feature sparsity refers to the sparsity of a feature vector; model sparsity refers to the sparsity of the model weights.

spatial pooling

#image

See pooling .

расколоть

#дф

В дереве решений другое название условия .

сплиттер

#дф

При обучении дерева решений подпрограмма (и алгоритм) отвечает за поиск наилучшего условия в каждом узле .

squared hinge loss

The square of the hinge loss . Squared hinge loss penalizes outliers more harshly than regular hinge loss.

squared loss

The loss function used in linear regression . (Also known as L 2 Loss .) This function calculates the squares of the difference between a model's predicted value for a labeled example and the actual value of the label . Due to squaring, this loss function amplifies the influence of bad predictions. That is, squared loss reacts more strongly to outliers than L 1 loss .

staged training

#language

A tactic of training a model in a sequence of discrete stages. The goal can be either to speed up the training process, or to achieve better model quality.

An illustration of the progressive stacking approach is shown below:

  • Stage 1 contains 3 hidden layers, stage 2 contains 6 hidden layers, and stage 3 contains 12 hidden layers.
  • Stage 2 begins training with the weights learned in the 3 hidden layers of Stage 1. Stage 3 begins training with the weights learned in the 6 hidden layers of Stage 2.

Three stages, which are labeled 'Stage 1', 'Stage 2', and 'Stage 3'.
          Each stage contains a different number of layers: Stage 1 contains
          3 layers, Stage 2 contains 6 layers, and Stage 3 contains 12 layers.
          The 3 layers from Stage 1 become the first 3 layers of Stage 2.
          Similarly, the 6 layers from Stage 2 become the first 6 layers of
          Stage 3.

See also pipelining .

state

#rl

In reinforcement learning, the parameter values that describe the current configuration of the environment, which the agent uses to choose an action .

state-action value function

#rl

Synonym for Q-function .

static model

A model that is trained offline.

stationarity

A property of data in a dataset, in which the data distribution stays constant across one or more dimensions. Most commonly, that dimension is time, meaning that data exhibiting stationarity doesn't change over time. For example, data that exhibits stationarity doesn't change from September to December.

step

A forward and backward evaluation of one batch .

step size

Synonym for learning rate .

stochastic gradient descent (SGD)

A gradient descent algorithm in which the batch size is one. In other words, SGD relies on a single example chosen uniformly at random from a dataset to calculate an estimate of the gradient at each step.

stride

#image

In a convolutional operation or pooling, the delta in each dimension of the next series of input slices. For example, the following animation demonstrates a (1,1) stride during a convolutional operation. Therefore, the next input slice starts one position to the right of the previous input slice. When the operation reaches the right edge, the next slice is all the way over to the left but one position down.

An input 5x5 matrix and a 3x3 convolutional filter. Because the
     stride is (1,1), a convolutional filter will be applied 9 times. The first
     convolutional slice evaluates the top-left 3x3 submatrix of the input
     matrix. The second slice evaluates the top-middle 3x3
     submatrix. The third convolutional slice evaluates the top-right 3x3
     submatrix.  The fourth slice evaluates the middle-left 3x3 submatrix.
     The fifth slice evaluates the middle 3x3 submatrix. The sixth slice
     evaluates the middle-right 3x3 submatrix. The seventh slice evaluates
     the bottom-left 3x3 submatrix.  The eighth slice evaluates the
     bottom-middle 3x3 submatrix. The ninth slice evaluates the bottom-right 3x3
     submatrix.

The preceding example demonstrates a two-dimensional stride. If the input matrix is three-dimensional, the stride would also be three-dimensional.

structural risk minimization (SRM)

An algorithm that balances two goals:

  • The desire to build the most predictive model (for example, lowest loss).
  • The desire to keep the model as simple as possible (for example, strong regularization).

For example, a function that minimizes loss+regularization on the training set is a structural risk minimization algorithm.

Contrast with empirical risk minimization .

subsampling

#image

See pooling .

summary

#TensorFlow

In TensorFlow, a value or set of values calculated at a particular step , usually used for tracking model metrics during training.

supervised machine learning

Training a model from input data and its corresponding labels . Supervised machine learning is analogous to a student learning a subject by studying a set of questions and their corresponding answers. After mastering the mapping between questions and answers, the student can then provide answers to new (never-before-seen) questions on the same topic. Compare with unsupervised machine learning .

synthetic feature

A feature not present among the input features, but created from one or more of them. Kinds of synthetic features include:

  • Bucketing a continuous feature into range bins.
  • Multiplying (or dividing) one feature value by other feature value(s) or by itself.
  • Creating a feature cross .

Features created by normalizing or scaling alone are not considered synthetic features.

Т

tabular Q-learning

#rl

In reinforcement learning, implementing Q-learning by using a table to store the Q-functions for every combination of state and action .

цель

Synonym for label .

target network

#rl

In Deep Q-learning , a neural network that is a stable approximation of the main neural network, where the main neural network implements either a Q-function or a policy . Then, you can train the main network on the Q-values predicted by the target network. Therefore, you prevent the feedback loop that occurs when the main network trains on Q-values predicted by itself. By avoiding this feedback, training stability increases.

temporal data

Data recorded at different points in time. For example, winter coat sales recorded for each day of the year would be temporal data.

Tensor

#TensorFlow

The primary data structure in TensorFlow programs. Tensors are N-dimensional (where N could be very large) data structures, most commonly scalars, vectors, or matrices. The elements of a Tensor can hold integer, floating-point, or string values.

TensorBoard

#TensorFlow

The dashboard that displays the summaries saved during the execution of one or more TensorFlow programs.

TensorFlow

#TensorFlow

A large-scale, distributed, machine learning platform. The term also refers to the base API layer in the TensorFlow stack, which supports general computation on dataflow graphs.

Although TensorFlow is primarily used for machine learning, you may also use TensorFlow for non-ML tasks that require numerical computation using dataflow graphs.

TensorFlow Playground

#TensorFlow

A program that visualizes how different hyperparameters influence model (primarily neural network) training. Go to http://playground.tensorflow.org to experiment with TensorFlow Playground.

TensorFlow Serving

#TensorFlow

A platform to deploy trained models in production.

Блок тензорной обработки (TPU)

#TensorFlow
#GoogleCloud

Специализированная интегральная схема (ASIC), оптимизирующая производительность рабочих нагрузок машинного обучения. Эти ASIC развернуты как несколько чипов TPU на устройстве TPU .

Tensor rank

#TensorFlow

See rank (Tensor) .

Tensor shape

#TensorFlow

The number of elements a Tensor contains in various dimensions. For example, a [5, 10] Tensor has a shape of 5 in one dimension and 10 in another.

Tensor size

#TensorFlow

The total number of scalars a Tensor contains. For example, a [5, 10] Tensor has a size of 50.

termination condition

#rl

In reinforcement learning, the conditions that determine when an episode ends, such as when the agent reaches a certain state or exceeds a threshold number of state transitions. For example, in tic-tac-toe (also known as noughts and crosses), an episode terminates either when a player marks three consecutive spaces or when all spaces are marked.

тест

#дф

В дереве решений другое название условия .

test set

The subset of the dataset that you use to test your model after the model has gone through initial vetting by the validation set.

Contrast with training set and validation set .

tf.Example

#TensorFlow

A standard protocol buffer for describing input data for machine learning model training or inference.

tf.keras

#TensorFlow

An implementation of Keras integrated into TensorFlow .

порог (для деревьев решений)

#дф

В условиях выравнивания по осям значение, с которым сравнивается функция . Например, 75 — это пороговое значение в следующем условии:

grade >= 75

time series analysis

#clustering

A subfield of machine learning and statistics that analyzes temporal data . Many types of machine learning problems require time series analysis, including classification, clustering, forecasting, and anomaly detection. For example, you could use time series analysis to forecast the future sales of winter coats by month based on historical sales data.

timestep

#seq

One "unrolled" cell within a recurrent neural network . For example, the following figure shows three timesteps (labeled with the subscripts t-1, t, and t+1):

Three timesteps in a recurrent neural network. The output of the
          first timestep becomes input to the second timestep. The output
          of the second timestep becomes input to the third timestep.

token

#language

In a language model , the atomic unit that the model is training on and making predictions on. A token is typically one of the following:

  • a word—for example, the phrase "dogs like cats" consists of three word tokens: "dogs", "like", and "cats".
  • a character—for example, the phrase "bike fish" consists of nine character tokens. (Note that the blank space counts as one of the tokens.)
  • subwords—in which a single word can be a single token or multiple tokens. A subword consists of a root word, a prefix, or a suffix. For example, a language model that uses subwords as tokens might view the word "dogs" as two tokens (the root word "dog" and the plural suffix "s"). That same language model might view the single word "taller" as two subwords (the root word "tall" and the suffix "er").

In domains outside of language models, tokens can represent other kinds of atomic units. For example, in computer vision, a token might be a subset of an image.

tower

A component of a deep neural network that is itself a deep neural network without an output layer. Typically, each tower reads from an independent data source. Towers are independent until their output is combined in a final layer.

ТПУ

#TensorFlow
#GoogleCloud

Аббревиатура для Tensor Processing Unit .

Чип ТПУ

#TensorFlow
#GoogleCloud

Программируемый ускоритель линейной алгебры со встроенной памятью с высокой пропускной способностью, оптимизированный для рабочих нагрузок машинного обучения. Несколько чипов TPU развернуты на устройстве TPU .

устройство ТПУ

#TensorFlow
#GoogleCloud

Печатная плата (PCB) с несколькими микросхемами TPU , сетевыми интерфейсами с высокой пропускной способностью и аппаратным обеспечением системного охлаждения.

Мастер ТПУ

#TensorFlow
#GoogleCloud

Процесс централизованной координации, выполняемый на хост-компьютере, который отправляет и получает данные, результаты, программы, информацию о производительности и работоспособности системы рабочим TPU . Мастер TPU также управляет настройкой и отключением устройств TPU .

узел ТПУ

#TensorFlow
#GoogleCloud

Ресурс TPU на Google Cloud Platform с определенным типом TPU . Узел TPU подключается к вашей сети VPC из одноранговой сети VPC . Узлы TPU — это ресурс, определенный в Cloud TPU API .

ТПУ стручок

#TensorFlow
#GoogleCloud

Конкретная конфигурация устройств TPU в центре обработки данных Google. Все устройства в модуле TPU подключены друг к другу через выделенную высокоскоростную сеть. Модуль TPU — это самая большая конфигурация устройств TPU, доступная для конкретной версии TPU.

Ресурс ТПУ

#TensorFlow
#GoogleCloud

Сущность TPU на Google Cloud Platform, которую вы создаете, управляете или используете. Например, узлы TPU и типы TPU являются ресурсами TPU.

Срез ТПУ

#TensorFlow
#GoogleCloud

Слайс TPU — это дробная часть устройств TPU в модуле TPU . Все устройства в слайсе TPU подключены друг к другу через выделенную высокоскоростную сеть.

Тип ТПУ

#TensorFlow
#GoogleCloud

Конфигурация одного или нескольких устройств TPU с определенной версией аппаратного обеспечения TPU. Вы выбираете тип TPU при создании узла TPU в Google Cloud Platform. Например, тип TPU v2-8 — это одно устройство TPU v2 с 8 ядрами. Тип v3-2048 TPU имеет 256 подключенных к сети устройств TPU v3 и в общей сложности 2048 ядер. Типы TPU — это ресурс, определенный в Cloud TPU API .

работник ТПУ

#TensorFlow
#GoogleCloud

Процесс, который запускается на хост-компьютере и выполняет программы машинного обучения на устройствах TPU .

training

The process of determining the ideal parameters comprising a model.

training-serving skew

The difference between a model's performance during training and that same model's performance during serving .

training set

The subset of the dataset used to train a model.

Contrast with validation set and test set .

trajectory

#rl

In reinforcement learning, a sequence of tuples that represent a sequence of state transitions of the agent , where each tuple corresponds to the state, action , reward , and next state for a given state transition.

transfer learning

Transferring information from one machine learning task to another. For example, in multi-task learning, a single model solves multiple tasks, such as a deep model that has different output nodes for different tasks. Transfer learning might involve transferring knowledge from the solution of a simpler task to a more complex one, or involve transferring knowledge from a task where there is more data to one where there is less data.

Most machine learning systems solve a single task. Transfer learning is a baby step towards artificial intelligence in which a single program can solve multiple tasks.

Transformer

#language

A neural network architecture developed at Google that relies on self-attention mechanisms to transform a sequence of input embeddings into a sequence of output embeddings without relying on convolutions or recurrent neural networks . A Transformer can be viewed as a stack of self-attention layers.

A Transformer can include any of the following:

An encoder transforms a sequence of embeddings into a new sequence of the same length. An encoder includes N identical layers, each of which contains two sub-layers. These two sub-layers are applied at each position of the input embedding sequence, transforming each element of the sequence into a new embedding. The first encoder sub-layer aggregates information from across the input sequence. The second encoder sub-layer transforms the aggregated information into an output embedding.

A decoder transforms a sequence of input embeddings into a sequence of output embeddings, possibly with a different length. A decoder also includes N identical layers with three sub-layers, two of which are similar to the encoder sub-layers. The third decoder sub-layer takes the output of the encoder and applies the self-attention mechanism to gather information from it.

The blog post Transformer: A Novel Neural Network Architecture for Language Understanding provides a good introduction to Transformers.

translational invariance

#image

In an image classification problem, an algorithm's ability to successfully classify images even when the position of objects within the image changes. For example, the algorithm can still identify a dog, whether it is in the center of the frame or at the left end of the frame.

See also size invariance and rotational invariance .

trigram

#seq
#language

An N-gram in which N=3.

true negative (TN)

An example in which the model correctly predicted the negative class . For example, the model inferred that a particular email message was not spam, and that email message really was not spam.

true positive (TP)

An example in which the model correctly predicted the positive class . For example, the model inferred that a particular email message was spam, and that email message really was spam.

true positive rate (TPR)

Synonym for recall . That is:

$$\text{True Positive Rate} = \frac{\text{True Positives}} {\text{True Positives} + \text{False Negatives}}$$

True positive rate is the y-axis in an ROC curve .

U

unawareness (to a sensitive attribute)

#fairness

A situation in which sensitive attributes are present, but not included in the training data. Because sensitive attributes are often correlated with other attributes of one's data, a model trained with unawareness about a sensitive attribute could still have disparate impact with respect to that attribute, or violate other fairness constraints .

underfitting

Producing a model with poor predictive ability because the model hasn't captured the complexity of the training data. Many problems can cause underfitting, including:

  • Training on the wrong set of features.
  • Training for too few epochs or at too low a learning rate.
  • Training with too high a regularization rate.
  • Providing too few hidden layers in a deep neural network.

undersampling

Removing examples from the majority class in a class-imbalanced dataset in order to create a more balanced training set .

For example, consider a dataset in which the ratio of the majority class to the minority class is 20:1. To overcome this class imbalance, you could create a training set consisting of all of the minority class examples but only a tenth of the majority class examples, which would create a training-set class ratio of 2:1. Thanks to undersampling, this more balanced training set might produce a better model. Alternatively, this more balanced training set might contain insufficient examples to train an effective model.

Contrast with oversampling .

unidirectional

#language

A system that only evaluates the text that precedes a target section of text. In contrast, a bidirectional system evaluates both the text that precedes and follows a target section of text. See bidirectional for more details.

unidirectional language model

#language

A language model that bases its probabilities only on the tokens appearing before , not after , the target token(s). Contrast with bidirectional language model .

unlabeled example

An example that contains features but no label . Unlabeled examples are the input to inference . In semi-supervised and unsupervised learning, unlabeled examples are used during training.

unsupervised machine learning

#clustering

Training a model to find patterns in a dataset, typically an unlabeled dataset.

The most common use of unsupervised machine learning is to cluster data into groups of similar examples. For example, an unsupervised machine learning algorithm can cluster songs together based on various properties of the music. The resulting clusters can become an input to other machine learning algorithms (for example, to a music recommendation service). Clustering can be helpful in domains where true labels are hard to obtain. For example, in domains such as anti-abuse and fraud, clusters can help humans better understand the data.

Another example of unsupervised machine learning is principal component analysis (PCA) . For example, applying PCA on a dataset containing the contents of millions of shopping carts might reveal that shopping carts containing lemons frequently also contain antacids.

Compare with supervised machine learning .

uplift modeling

A modeling technique, commonly used in marketing, that models the "causal effect" (also known as the "incremental impact") of a "treatment" on an "individual." Here are two examples:

  • Doctors might use uplift modeling to predict the mortality decrease (causal effect) of a medical procedure (treatment) depending on the age and medical history of a patient (individual).
  • Marketers might use uplift modeling to predict the increase in probability of a purchase (causal effect) due to an advertisement (treatment) on a person (individual).

Uplift modeling differs from classification or regression in that some labels (for example, half of the labels in binary treatments) are always missing in uplift modeling. For example, a patient can either receive or not receive a treatment; therefore, we can only observe whether the patient is going to heal or not heal in only one of these two situations (but never both). The main advantage of an uplift model is that it can generate predictions for the unobserved situation (the counterfactual) and use it to compute the causal effect.

upweighting

Applying a weight to the downsampled class equal to the factor by which you downsampled.

user matrix

#recsystems

In recommendation systems , an embedding generated by matrix factorization that holds latent signals about user preferences. Each row of the user matrix holds information about the relative strength of various latent signals for a single user. For example, consider a movie recommendation system. In this system, the latent signals in the user matrix might represent each user's interest in particular genres, or might be harder-to-interpret signals that involve complex interactions across multiple factors.

The user matrix has a column for each latent feature and a row for each user. That is, the user matrix has the same number of rows as the target matrix that is being factorized. For example, given a movie recommendation system for 1,000,000 users, the user matrix will have 1,000,000 rows.

В

validation

A process used, as part of training , to evaluate the quality of a machine learning model using the validation set . Because the validation set is disjoint from the training set, validation helps ensure that the model's performance generalizes beyond the training set.

Contrast with test set .

validation set

A subset of the dataset—disjoint from the training set—used in validation .

Contrast with training set and test set .

vanishing gradient problem

#seq

The tendency for the gradients of early hidden layers of some deep neural networks to become surprisingly flat (low). Increasingly lower gradients result in increasingly smaller changes to the weights on nodes in a deep neural network, leading to little or no learning. Models suffering from the vanishing gradient problem become difficult or impossible to train. Long Short-Term Memory cells address this issue.

Compare to exploding gradient problem .

переменная важность

#дф

Набор оценок, который указывает относительную важность каждой функции для модели.

Например, рассмотрим дерево решений , оценивающее цены на жилье. Предположим, что это дерево решений использует три характеристики: размер, возраст и стиль. Если набор переменных важностей для трех признаков рассчитывается как {размер = 5,8, возраст = 2,5, стиль = 4,7}, тогда размер важнее для дерева решений, чем возраст или стиль.

Существуют различные метрики важности переменных, которые могут информировать экспертов по машинному обучению о различных аспектах моделей.

Вт

Wasserstein loss

One of the loss functions commonly used in generative adversarial networks , based on the earth mover's distance between the distribution of generated data and real data.

масса

A coefficient for a feature in a linear model, or an edge in a deep network. The goal of training a linear model is to determine the ideal weight for each feature. If a weight is 0, then its corresponding feature does not contribute to the model.

Weighted Alternating Least Squares (WALS)

#recsystems

An algorithm for minimizing the objective function during matrix factorization in recommendation systems , which allows a downweighting of the missing examples. WALS minimizes the weighted squared error between the original matrix and the reconstruction by alternating between fixing the row factorization and column factorization. Each of these optimizations can be solved by least squares convex optimization . For details, see the Recommendation Systems course

wide model

A linear model that typically has many sparse input features . We refer to it as "wide" since such a model is a special type of neural network with a large number of inputs that connect directly to the output node. Wide models are often easier to debug and inspect than deep models. Although wide models cannot express nonlinearities through hidden layers , they can use transformations such as feature crossing and bucketization to model nonlinearities in different ways.

Contrast with deep model .

width

The number of neurons in a particular layer of a neural network .

мудрость толпы

#дф

Идея о том, что усреднение мнений или оценок большой группы людей («толпы») часто дает удивительно хорошие результаты. Например, рассмотрим игру, в которой люди угадывают количество драже, упакованных в большую банку. Хотя большинство отдельных догадок будут неточными, эмпирически показано, что среднее значение всех догадок на удивление близко к фактическому количеству драже в банке.

Ансамбли — это программный аналог мудрости толпы. Даже если отдельные модели делают крайне неточные прогнозы, усреднение прогнозов многих моделей часто дает удивительно хорошие прогнозы. Например, хотя отдельное дерево решений может давать плохие прогнозы, лес решений часто дает очень хорошие прогнозы.

word embedding

#language

Representing each word in a word set within an embedding ; that is, representing each word as a vector of floating-point values between 0.0 and 1.0. Words with similar meanings have more-similar representations than words with different meanings. For example, carrots , celery , and cucumbers would all have relatively similar representations, which would be very different from the representations of airplane , sunglasses , and toothpaste .

Z

Z-score normalization

A normalization technique that replaces a raw feature value with a floating-point value representing the number of standard deviations from that feature's mean. For example, consider a feature whose mean is 800 and whose standard deviation is 100. The following table shows how Z-score normalization would map the raw value to its Z-score:

Raw value Z-оценка
800 0
950 +1.5
575 -2.25

The machine learning model then trains on the Z-scores for that feature instead of on the raw values.