MySQL のセミジョイン最適化の各戦略について調べてみた - エス・エム・エスエンジニアテックブログ

この記事は、「MySQL Advent Calendar 2022」の13日目の記事です。

qiita.com

株式会社エス・エム・エスでエンジニアをしている@koma_koma_dです。今回はMySQLにおけるセミジョイン最適化について調べた内容を書きます。

※記載内容に誤りなどがある場合は筆者のTwitter宛に連絡をいただけると幸いです。

前置き

この記事で書くこと

この記事では、MySQLにおけるセミジョイン最適化について、サンプルテーブルを用いた実行例を示しながら、

セミジョイン最適化とは何か
セミジョイン最適化はなぜ有用か
セミジョイン最適化にはどのような種類があるのか
どのようにクエリやテーブル定義を変えると戦略が変化するか
どのような実行計画になるか
どのようなオプティマイザトレースになるか

などを紹介します。

執筆にあたって、Web上のリソースなどをある程度調べましたが、自分が気になった上記のような内容を網羅しているものが見当たらなかったので、誰かの役に立つかもと思って書いています。

この記事で書かないこと

MySQLの公式ドキュメントを読むだけでわかること

公式ドキュメントを読むだけでわかることについては、この記事では書きません（必要に応じて公式ドキュメントから引用をする場合はあります）。

コードリーディングに基づいた知見

MySQLのコードはGitHubで公開されているので、コードを読むことで動作を読み解くことも可能ではありますが、筆者にはその技量まではないので今回は対象外です。

GitHub - mysql/mysql-server: MySQL Server, the world's most popular open source database, and MySQL Cluster, a real-time, open source transactional database.

MySQLのバージョン間の比較

MySQLはセミジョイン最適化が導入された後も進歩を続けており、その過程でセミジョイン最適化に関連するバージョンアップも行われているようですが、それらについて細かく言及することは今回の記事ではしません。

他のRDBMSとの比較

Oracleなどの他のRDBMSにも類似した最適化があるようですが、それらとの比較は今回は対象外とします。

セミジョイン最適化概説

前置きが長くなりましたが、本題に入っていきます。今回のテーマであるセミジョイン最適化は、MySQL 5.6 からサブクエリの実行に関する最適化として導入されたものです。まず、なぜセミジョイン最適化が「最適化」たりうるのか、パフォーマンス的に嬉しいのかを説明します。

なぜセミジョイン最適化がパフォーマンス的に嬉しいのか？

本来、サブクエリはメインクエリに従属しており、サブクエリ側からはメインクエリ側のカラムを参照できます（相関サブクエリはこれを利用したもの）。サブクエリ側からメインクエリ側のカラムを参照できるということは、メインクエリ側のテーブルが先にアクセスされるということです。

しかし、セミジョイン最適化が行われると、結合（JOIN）として処理することになるので、①先にサブクエリ側のテーブルにアクセスすることが可能になります。このため、サブクエリ側のテーブルに先にアクセスした方が効率が良い場合には、パフォーマンス上のメリットを享受できる可能性があります。『詳解MySQL 5.7』では以下のように記載されています。

MySQL 5.6 では、 IN サブクエリを SEMIJOIN という特殊な JOIN へと変換することで、より効率的な実行計画が選択されるようになった。SEMIJOINとは、駆動表の1行に対して内部表からマッチする行が1行だけになるという特殊な結果を産むJOINである。

SEMIJOINがなぜunique_subqueryやindex_subqueryより優れているかというと、テーブルをJOINする順序を入れ替えられるからである。サブクエリ内でアクセスされるテーブルを先にアクセスするような実行計画のほうが効率的なものになるケースは少なくない。

（『詳解MySQL 5.7』p.110）

www.shoeisha.co.jp

更に、セミジョイン最適化を適用したクエリは、通常の結合では必ずしも満たされていない条件である、

最終的な結果にサブクエリ側のカラムが含まれない
メインクエリ側の1行に対してサブクエリ側の複数行をマッチさせない（マッチすることを考慮しなくてよい）

という条件を満たすため、②通常の結合では取ることのできない処理の効率化を行うことができます。（MySQLに焦点を当てた記述ではありませんが）『SQL実践入門』では以下のように記載されています。

「Semi-Join」は日本語では「準結合」または「半結合」と呼ばれています。これは通常の結合の際には現れない、EXISTS述語（とIN述語）を使ったときに特有のアルゴリズムです。このアルゴリズムの特徴は次の2つです。

機能的には、結果には駆動表となるテーブルのデータしか含まれず、しかも1行につき必ず1行しか結果が生成されない（通常の結合の場合、1対Nの結合の場合は行数が増えることがある）

内部表にマッチする行を1行でも発見した時点で残りの行の検索を打ち切れるため、通常の結合よりもパフォーマンスが良い

（『SQL実践入門』p.341）

gihyo.jp

以上で記載した、

①先にサブクエリ側のテーブルにアクセスすることが可能
②通常の結合では取ることのできない処理の効率化を行うことができる

という2つの特性をどのように活かすかが、次に紹介するそれぞれのセミジョインの戦略で違ってきます。

セミジョインにはどのような「戦略」があるのか？

セミジョインには、いくつかの種類があります。MySQLではそれらを「戦略（Strategy）」と表現しています。各戦略の特徴は、以下のように整理できます。

なお、表中の「重複の除去」は、先述の「メインクエリ側の1行に対してサブクエリ側の複数行をマッチさせない（マッチすることを考慮しなくてよい）」という側面に関するもので、メインクエリ側の1行がサブクエリ結果との結合によって最終的な結果の中で重複しない理由、重複させない方法を記載しています。

戦略の名称	サブクエリ側の結合キー列のINDEXの必要性	駆動表と内部表	重複の除去
Table Pullout	UNIQUE制約必要	可変	UNIQUE制約により保証
LooseScan	必要	サブクエリ側が駆動表	インデックスを活用しながら結合時に実施
Materialization	不要（一時表で自動作成）	可変	サブクエリ実体化時に除去
Duplicate Weedout	不要	可変	結果を返す前に除去
FirstMatch	不要	メインクエリ側が駆動表	結合時に除去

以下、個別に補足説明を加えます。記載している内容は参考資料に依拠しているほか、サンプルテーブルを使った実行例から分かる内容を記載しています。

主な参考資料は以下の2つです。

MySQL道普請便り第43回 MySQLの準結合（セミジョイン）について
- セミジョインについての親切な紹介。駆動表と内部表がどうなるかはこちらに主に依拠した。
MariaDB 10.6 [日本語] 最適化とチューニングセミジョイン副問い合わせの最適化
- （MySQLではなくMariaDBですが）それぞれの戦略が図付きで解説されていてわかりやすい。

各戦略の特徴

ここからは、上で記載した表の内容を戦略ごとに補足していきます。各戦略について細かく論じていくにあたって、サンプルテーブルを用いて実際に実行計画やオプティマイザトレースを取得した結果を随時示します。実行計画やオプティマイザトレースについては、実物を示すのが最も参考になると思いましたので、厚め（実行計画は全部、オプティマイザトレースは一部抜粋）に載せましたが、記事が長くなってしまうので折りたたんでいます。展開したい場合は「▶︎詳細」となっているところをクリックしてください。

サンプルテーブルの前提

サンプルテーブルの前提条件を記載しておきます。

使用したMySQLのバージョン

MySQL 8.0.21

※現時点での最新は 8.0.31 です。たまたま調べようと思ったタイミングでローカルに入っていたのがこのバージョン（以前個人ブログの方に書いた記事の検証で使ったバージョンだった）だったために過ぎません。

サンプルテーブル

CREATE TABLE `employee` (
  `emp_id` int unsigned NOT NULL AUTO_INCREMENT,
  `main_floor` int unsigned DEFAULT NULL,
  `gender` int unsigned DEFAULT NULL,
  PRIMARY KEY (`emp_id`),
) ENGINE=InnoDB;

CREATE TABLE `department` (
  `dept_id` int unsigned NOT NULL AUTO_INCREMENT,
  `main_floor` int unsigned DEFAULT NULL,
  `dept_name` varchar(100) DEFAULT NULL,
  PRIMARY KEY (`dept_id`)
) ENGINE=InnoDB;

データ内容（クリックで展開）

employee テーブル

emp_id	main_floor	gender
1	1	1
2	2	2
3	3	3
4	4	1
5	5	2
6	6	3
7	7	1
8	8	2
9	9	3
10	10	1
11	11	2
12	12	3
13	13	1
14	14	2
15	15	3
16	16	1
17	17	2
18	18	3
19	1	1
20	2	2
21	3	3
22	4	1
23	5	2
24	6	3
25	7	1
26	8	2
27	9	3
28	10	1
29	11	2
30	12	3
31	13	1
32	14	2
33	15	3
34	16	1
35	17	2
36	18	3

department テーブル

dept_id	main_floor	dept_name
1	1	Finance
2	2	Legal
3	3	Human Resorces
4	4	Corporate Planning
5	5	Sales 1
6	6	Sales 2
7	7	Accounting
8	8	Development 1
9	9	Development 2

Table Pullout

Table Pullout 戦略は、以下のような特徴を持ちます。

通常のJOINとして処理する
結合キーにUNIQUE制約（PRIMARY KEY含む）がある場合に利用可能
- メインクエリの結果1行に対してサブクエリから0or1行しか返らないことが保証されているので、通常のJOINにして結合順序を入れ替えてもメインクエリ側の行が最終結果の中で重複することがない

サンプルテーブルを用いた実行例から分かることは以下の通りです。

department 表の main_floor 列にUNIQUE制約を付与したところ選択された
オプティマイザトレースの pulled_out_semijoin_tables という項目に department 表が表れている
EXPLAIN ANALYZE をみると、 Remove duplicates from ... という重複除去を表す情報がない
- ここが後述のLooseScanとの違い

SELECT *
FROM employee e 
WHERE e.main_floor IN (
    SELECT main_floor 
    FROM department d
)

実行計画

mysql> explain  select * from employee e  where e.main_floor in ( select main_floor  from department d );
+----+-------------+-------+------------+-------+---------------------------+---------------------------+---------+----------------------+------+----------+--------------------------+
| id | select_type | table | partitions | type  | possible_keys             | key                       | key_len | ref                  | rows | filtered | Extra                    |
+----+-------------+-------+------------+-------+---------------------------+---------------------------+---------+----------------------+------+----------+--------------------------+
|  1 | SIMPLE      | d     | NULL       | index | department_main_floor_IDX | department_main_floor_IDX | 5       | NULL                 |    9 |   100.00 | Using where; Using index |
|  1 | SIMPLE      | e     | NULL       | ref   | employee_main_floor_IDX   | employee_main_floor_IDX   | 5       | sandbox.d.main_floor |    2 |   100.00 | NULL                     |
+----+-------------+-------+------------+-------+---------------------------+---------------------------+---------+----------------------+------+----------+--------------------------+
2 rows in set, 1 warning (0.00 sec)

Note (Code 1003): /* select#1 */ select `sandbox`.`e`.`emp_id` AS `emp_id`,`sandbox`.`e`.`main_floor` AS `main_floor`,`sandbox`.`e`.`gender` AS `gender` from `sandbox`.`department` `d` join `sandbox`.`employee` `e` where (`sandbox`.`e`.`main_floor` = `sandbox`.`d`.`main_floor`)

mysql> explain analyze  select * from employee e  where e.main_floor in ( select main_floor  from department d )\G
*************************** 1. row ***************************
EXPLAIN: -> Nested loop inner join  (cost=7.45 rows=18) (actual time=0.040..0.078 rows=18 loops=1)
    -> Filter: (d.main_floor is not null)  (cost=1.15 rows=9) (actual time=0.023..0.026 rows=9 loops=1)
        -> Index scan on d using department_main_floor_IDX  (cost=1.15 rows=9) (actual time=0.022..0.024 rows=9 loops=1)
    -> Index lookup on e using employee_main_floor_IDX (main_floor=d.main_floor)  (cost=0.52 rows=2) (actual time=0.005..0.005 rows=2 loops=9)

オプティマイザトレース（一部抜粋）

{
  "steps": [
    {
      "join_preparation": {
        "select#": 1,
        "steps": [
          // 中略
          {
            "transformations_to_nested_joins": {
              "transformations": [
                "semijoin"
              ],
              "expanded_query": "/* select#1 */ select `e`.`emp_id` AS `emp_id`,`e`.`main_floor` AS `main_floor`,`e`.`gender` AS `gender` from `employee` `e` semi join (`department` `d`) where ((`e`.`main_floor` = `d`.`main_floor`))"
            }
          }
        ]
      }
    },
    {
      "join_optimization": {
        "select#": 1,
        "steps": [
          // 中略
          {
            "pulled_out_semijoin_tables": [
              {
                "table": "`department` `d`",
                "functionally_dependent": true
              }
            ]
          },
          // 中略
        ]
      }
    },
    {
      "join_explain": {
        "select#": 1,
        "steps": [
        ]
      }
    }
  ]
}

LooseScan

LooseScan 戦略は以下のような特徴を持ちます。

サブクエリ側のテーブルの結合キーのカラムにインデックスがある場合に利用可能
- ※UNIQUE制約がついていれば Table Pullout も使えることになる
サブクエリ側のインデックスを重複を避けながらスキャンしていく
サブクエリ側が必ず駆動表になる
- 重複の除去を、サブクエリのインデックスを使って実現するため

サンプルテーブルを用いた実行例から分かることは以下の通りです。

department 表の main_floor 列にインデックスを追加したところ、LooseScan が候補に上がるようになった
- UNIQUE制約をつけると Table Pullout も利用可能になる
オプティマイザトレースのfinal_semijoin_strategy が LooseScan
実行計画の Extra 列に LooseScan と表示
EXPLAIN ANALYZE をみると、 Remove duplicates from ... という重複除去を表す情報がある
- ここがTable Pullout との違いになる

SELECT *
FROM employee e 
WHERE e.main_floor IN (
    SELECT main_floor 
    FROM department d
)

※ Table Pullout と同じクエリ

実行計画

mysql> explain  select * from employee e  where e.main_floor in ( select main_floor  from department d );
+----+-------------+-------+------------+-------+---------------------------+---------------------------+---------+----------------------+------+----------+-------------------------------------+
| id | select_type | table | partitions | type  | possible_keys             | key                       | key_len | ref                  | rows | filtered | Extra                               |
+----+-------------+-------+------------+-------+---------------------------+---------------------------+---------+----------------------+------+----------+-------------------------------------+
|  1 | SIMPLE      | d     | NULL       | index | department_main_floor_IDX | department_main_floor_IDX | 5       | NULL                 |    9 |   100.00 | Using where; Using index; LooseScan |
|  1 | SIMPLE      | e     | NULL       | ref   | employee_main_floor_IDX   | employee_main_floor_IDX   | 5       | sandbox.d.main_floor |    2 |   100.00 | NULL                                |
+----+-------------+-------+------------+-------+---------------------------+---------------------------+---------+----------------------+------+----------+-------------------------------------+
2 rows in set, 1 warning (0.00 sec)

Note (Code 1003): /* select#1 */ select `sandbox`.`e`.`emp_id` AS `emp_id`,`sandbox`.`e`.`main_floor` AS `main_floor`,`sandbox`.`e`.`gender` AS `gender` from `sandbox`.`employee` `e` semi join (`sandbox`.`department` `d`) where (`sandbox`.`e`.`main_floor` = `sandbox`.`d`.`main_floor`)

mysql> explain analyze  select * from employee e  where e.main_floor in ( select main_floor  from department d )\G
*************************** 1. row ***************************
EXPLAIN: -> Nested loop inner join  (actual time=0.109..0.141 rows=18 loops=1)
    -> Remove duplicates from input sorted on department_main_floor_IDX  (actual time=0.091..0.095 rows=9 loops=1)
        -> Filter: (d.main_floor is not null)  (cost=1.15 rows=9) (actual time=0.090..0.093 rows=9 loops=1)
            -> Index scan on d using department_main_floor_IDX  (cost=1.15 rows=9) (actual time=0.089..0.091 rows=9 loops=1)
    -> Index lookup on e using employee_main_floor_IDX (main_floor=d.main_floor)  (cost=4.70 rows=2) (actual time=0.004..0.005 rows=2 loops=9)

オプティマイザトレース（一部抜粋）

{
  "steps": [
    {
      "join_preparation": {
        "select#": 1,
        "steps": [
          // 中略
          {
            "transformations_to_nested_joins": {
              "transformations": [
                "semijoin"
              ],
              "expanded_query": "/* select#1 */ select `e`.`emp_id` AS `emp_id`,`e`.`main_floor` AS `main_floor`,`e`.`gender` AS `gender` from `employee` `e` semi join (`department` `d`) where ((`e`.`main_floor` = `d`.`main_floor`))"
            }
          }
        ]
      }
    },
    {
      "join_optimization": {
        "select#": 1,
        "steps": [
          // 中略
          {
            "considered_execution_plans": [
              {
                "plan_prefix": [
                ],
                "table": "`department` `d`",
                "best_access_path": {
                  "considered_access_paths": [
                    {
                      "access_type": "ref",
                      "index": "department_main_floor_IDX",
                      "usable": false,
                      "chosen": false
                    },
                    {
                      "rows_to_scan": 9,
                      "filtering_effect": [
                      ],
                      "final_filtering_effect": 1,
                      "access_type": "scan",
                      "resulting_rows": 9,
                      "cost": 1.15,
                      "chosen": true
                    }
                  ]
                },
                "condition_filtering_pct": 100,
                "rows_for_plan": 9,
                "cost_for_plan": 1.15,
                "semijoin_strategy_choice": [
                  {
                    "strategy": "MaterializeScan",
                    "choice": "deferred"
                  }
                ],
                "rest_of_plan": [
                  {
                    "plan_prefix": [
                      "`department` `d`"
                    ],
                    "table": "`employee` `e`",
                    "best_access_path": {
                      "considered_access_paths": [
                        {
                          "access_type": "ref",
                          "index": "employee_main_floor_IDX",
                          "rows": 2,
                          "cost": 6.3,
                          "chosen": true
                        },
                        {
                          "rows_to_scan": 36,
                          "filtering_effect": [
                          ],
                          "final_filtering_effect": 1,
                          "access_type": "scan",
                          "using_join_cache": true,
                          "buffers_needed": 1,
                          "resulting_rows": 36,
                          "cost": 32.65,
                          "chosen": false
                        }
                      ]
                    },
                    "condition_filtering_pct": 100,
                    "rows_for_plan": 18,
                    "cost_for_plan": 7.45,
                    "semijoin_strategy_choice": [
                      {
                        "strategy": "LooseScan",
                        "recalculate_access_paths_and_cost": {
                          "tables": [
                            {
                              "table": "`department` `d`",
                              "best_access_path": {
                                "considered_access_paths": [
                                  {
                                    "access_type": "ref",
                                    "index": "department_main_floor_IDX",
                                    "usable": false,
                                    "chosen": false
                                  },
                                  {
                                    "rows_to_scan": 9,
                                    "filtering_effect": [
                                    ],
                                    "final_filtering_effect": 1,
                                    "access_type": "scan",
                                    "resulting_rows": 9,
                                    "cost": 1.15,
                                    "chosen": true
                                  }
                                ]
                              },
                              "unknown_key_1": {
                                "searching_loose_scan_index": {
                                  "indexes": [
                                    {
                                      "index": "department_main_floor_IDX",
                                      "covering_scan": {
                                        "cost": 0.2522,
                                        "chosen": true
                                      }
                                    }
                                  ]
                                }
                              }
                            }
                          ]
                        },
                        "cost": 7.4522,
                        "rows": 2,
                        "chosen": true
                      },
                      {
                        "strategy": "MaterializeScan",
                        "recalculate_access_paths_and_cost": {
                          "tables": [
                            {
                              "table": "`employee` `e`",
                              "best_access_path": {
                                "considered_access_paths": [
                                  {
                                    "access_type": "ref",
                                    "index": "employee_main_floor_IDX",
                                    "rows": 2,
                                    "cost": 6.3,
                                    "chosen": true
                                  },
                                  {
                                    "rows_to_scan": 36,
                                    "filtering_effect": [
                                    ],
                                    "final_filtering_effect": 1,
                                    "access_type": "scan",
                                    "using_join_cache": true,
                                    "buffers_needed": 1,
                                    "resulting_rows": 36,
                                    "cost": 32.65,
                                    "chosen": false
                                  }
                                ]
                              }
                            }
                          ]
                        },
                        "cost": 10.25,
                        "rows": 2,
                        "duplicate_tables_left": false,
                        "chosen": false
                      },
                      {
                        "strategy": "DuplicatesWeedout",
                        "cost": 12.05,
                        "rows": 18,
                        "duplicate_tables_left": false,
                        "chosen": false
                      }
                    ],
                    "chosen": true
                  }
                ]
              },
              {
                "plan_prefix": [
                ],
                "table": "`employee` `e`",
                "best_access_path": {
                  "considered_access_paths": [
                    {
                      "access_type": "ref",
                      "index": "employee_main_floor_IDX",
                      "usable": false,
                      "chosen": false
                    },
                    {
                      "rows_to_scan": 36,
                      "filtering_effect": [
                      ],
                      "final_filtering_effect": 1,
                      "access_type": "scan",
                      "resulting_rows": 36,
                      "cost": 3.85,
                      "chosen": true
                    }
                  ]
                },
                "condition_filtering_pct": 100,
                "rows_for_plan": 36,
                "cost_for_plan": 3.85,
                "semijoin_strategy_choice": [
                ],
                "rest_of_plan": [
                  {
                    "plan_prefix": [
                      "`employee` `e`"
                    ],
                    "table": "`department` `d`",
                    "best_access_path": {
                      "considered_access_paths": [
                        {
                          "access_type": "ref",
                          "index": "department_main_floor_IDX",
                          "rows": 1,
                          "cost": 12.6,
                          "chosen": true
                        },
                        {
                          "access_type": "scan",
                          "chosen": false,
                          "cause": "covering_index_better_than_full_scan"
                        }
                      ]
                    },
                    "condition_filtering_pct": 100,
                    "rows_for_plan": 36,
                    "cost_for_plan": 16.45,
                    "semijoin_strategy_choice": [
                      {
                        "strategy": "FirstMatch",
                        "recalculate_access_paths_and_cost": {
                          "tables": [
                          ]
                        },
                        "cost": 16.45,
                        "rows": 36,
                        "chosen": true
                      },
                      {
                        "strategy": "MaterializeLookup",
                        "cost": 10.5,
                        "rows": 36,
                        "duplicate_tables_left": false,
                        "chosen": true
                      },
                      {
                        "strategy": "DuplicatesWeedout",
                        "cost": 24.65,
                        "rows": 36,
                        "duplicate_tables_left": false,
                        "chosen": false
                      }
                    ],
                    "pruned_by_cost": true
                  }
                ]
              },
              {
                "final_semijoin_strategy": "LooseScan",
                // 中略
                }
              }
            ]
          },
          //中略
          
        ]
      }
    },
    {
      "join_explain": {
        "select#": 1,
        "steps": [
        ]
      }
    }
  ]
}

Materialization

Materialization 戦略は以下の特徴を持ちます。

サブクエリの結果を実体化して、結合キーにインデックスを作成して重複を取り除いてからメインクエリとJOINする
- ※結合キーのカラムにインデックスがあれば LooseScan が使えるし、UNIQUE制約がついていれば Table Pullout が使える
  - ただし、コスト次第で他の戦略ではなくこちらが選択されることはある
    - たとえば、サブクエリでWhere句での絞り込みをしていて、絞り込み後に実体化する Materialization の方が、LooseScan（JOIN時に絞り込みを行う）よりもコストが低くなるケースなど
作成されたインデックスのキーが <auto_key> という形で実行計画に現れることがある
メインクエリ側とサブクエリ側のどちらが駆動表、内部表となるかはコストで決まる
- 実体化されたサブクエリ側が内部表になる場合が MaterializeLookup
- 実体化されたサブクエリ側が駆動表になる場合が MaterializeScan
  - 参考： MySQL: Query Optimizer
    - 4.MaterializeLookup (Materialize inner tables, then setup a scan over outer correlated tables, lookup in materialized table)
    - 5.MaterializeScan (Materialize inner tables, then setup a scan over materialized tables, perform lookup in outer tables)

サンプルテーブルを用いた実行例から分かることは以下の通りです。

これまでのクエリのサブクエリにWHERE句を追加したところ選択された
オプティマイザトレースのfinal_semijoin_strategyが MaterializeScan
- 今回はサブクエリ側の方が駆動表になったパターン
実行計画の select_type に MATERIALIZED と表示

SELECT *
FROM employee e 
WHERE e.main_floor IN (
    SELECT main_floor 
    FROM department d
    WHERE d.dept_name LIKE 'Sales%'
)

実行計画

mysql> Explain select * from employee e  where e.main_floor in ( select main_floor  from department d  where d.dept_name like 'Sales%' );
+----+--------------+-------------+------------+------+-------------------------+-------------------------+---------+------------------------+------+----------+-------------+
| id | select_type  | table       | partitions | type | possible_keys           | key                     | key_len | ref                    | rows | filtered | Extra       |
+----+--------------+-------------+------------+------+-------------------------+-------------------------+---------+------------------------+------+----------+-------------+
|  1 | SIMPLE       | <subquery2> | NULL       | ALL  | NULL                    | NULL                    | NULL    | NULL                   | NULL |   100.00 | Using where |
|  1 | SIMPLE       | e           | NULL       | ref  | employee_main_floor_IDX | employee_main_floor_IDX | 5       | <subquery2>.main_floor |    2 |   100.00 | NULL        |
|  2 | MATERIALIZED | d           | NULL       | ALL  | NULL                    | NULL                    | NULL    | NULL                   |    9 |    11.11 | Using where |
+----+--------------+-------------+------------+------+-------------------------+-------------------------+---------+------------------------+------+----------+-------------+
3 rows in set, 1 warning (0.00 sec)

Note (Code 1003): /* select#1 */ select `sandbox`.`e`.`emp_id` AS `emp_id`,`sandbox`.`e`.`main_floor` AS `main_floor`,`sandbox`.`e`.`gender` AS `gender` from `sandbox`.`employee` `e` semi join (`sandbox`.`department` `d`) where ((`sandbox`.`e`.`main_floor` = `<subquery2>`.`main_floor`) and (`sandbox`.`d`.`dept_name` like 'Sales%'))

オプティマイザトレース（一部抜粋）

{
  "steps": [
    {
      "join_preparation": {
        "select#": 1,
        "steps": [
          // 中略
          {
            "transformations_to_nested_joins": {
              "transformations": [
                "semijoin"
              ],
              "expanded_query": "/* select#1 */ select `e`.`emp_id` AS `emp_id`,`e`.`main_floor` AS `main_floor`,`e`.`gender` AS `gender` from `employee` `e` semi join (`department` `d`) where ((`d`.`dept_name` like 'Sales%') and (`e`.`main_floor` = `d`.`main_floor`))"
            }
          }
        ]
      }
    },
    {
      "join_optimization": {
        "select#": 1,
        "steps": [
          // 中略

          {
            "considered_execution_plans": [
              {
                "plan_prefix": [
                ],
                "table": "`department` `d`",
                "best_access_path": {
                  "considered_access_paths": [
                    {
                      "rows_to_scan": 9,
                      "filtering_effect": [
                      ],
                      "final_filtering_effect": 0.1111,
                      "access_type": "scan",
                      "resulting_rows": 1,
                      "cost": 1.15,
                      "chosen": true
                    }
                  ]
                },
                "condition_filtering_pct": 100,
                "rows_for_plan": 1,
                "cost_for_plan": 1.15,
                "semijoin_strategy_choice": [
                  {
                    "strategy": "MaterializeScan",
                    "choice": "deferred"
                  }
                ],
                "rest_of_plan": [
                  {
                    "plan_prefix": [
                      "`department` `d`"
                    ],
                    "table": "`employee` `e`",
                    "best_access_path": {
                      "considered_access_paths": [
                        {
                          "access_type": "ref",
                          "index": "employee_main_floor_IDX",
                          "rows": 2,
                          "cost": 0.7,
                          "chosen": true
                        },
                        {
                          "rows_to_scan": 36,
                          "filtering_effect": [
                          ],
                          "final_filtering_effect": 1,
                          "access_type": "scan",
                          "using_join_cache": true,
                          "buffers_needed": 1,
                          "resulting_rows": 36,
                          "cost": 3.8503,
                          "chosen": false
                        }
                      ]
                    },
                    "condition_filtering_pct": 100,
                    "rows_for_plan": 2,
                    "cost_for_plan": 1.85,
                    "semijoin_strategy_choice": [
                      {
                        "strategy": "MaterializeScan",
                        "recalculate_access_paths_and_cost": {
                          "tables": [
                            {
                              "table": "`employee` `e`",
                              "best_access_path": {
                                "considered_access_paths": [
                                  {
                                    "access_type": "ref",
                                    "index": "employee_main_floor_IDX",
                                    "rows": 2,
                                    "cost": 0.7,
                                    "chosen": true
                                  },
                                  {
                                    "rows_to_scan": 36,
                                    "filtering_effect": [
                                    ],
                                    "final_filtering_effect": 1,
                                    "access_type": "scan",
                                    "using_join_cache": true,
                                    "buffers_needed": 1,
                                    "resulting_rows": 36,
                                    "cost": 3.8503,
                                    "chosen": false
                                  }
                                ]
                              }
                            }
                          ]
                        },
                        "cost": 3.05,
                        "rows": 2,
                        "duplicate_tables_left": true,
                        "chosen": true
                      },
                      {
                        "strategy": "DuplicatesWeedout",
                        "cost": 3.25,
                        "rows": 2,
                        "duplicate_tables_left": false,
                        "chosen": false
                      }
                    ],
                    "chosen": true
                  }
                ]
              },
              {
                "plan_prefix": [
                ],
                "table": "`employee` `e`",
                "best_access_path": {
                  "considered_access_paths": [
                    {
                      "access_type": "ref",
                      "index": "employee_main_floor_IDX",
                      "usable": false,
                      "chosen": false
                    },
                    {
                      "rows_to_scan": 36,
                      "filtering_effect": [
                      ],
                      "final_filtering_effect": 1,
                      "access_type": "scan",
                      "resulting_rows": 36,
                      "cost": 3.85,
                      "chosen": true
                    }
                  ]
                },
                "condition_filtering_pct": 100,
                "rows_for_plan": 36,
                "cost_for_plan": 3.85,
                "semijoin_strategy_choice": [
                ],
                "pruned_by_cost": true
              },
              {
                "final_semijoin_strategy": "MaterializeScan",
                // 中略
              }
            ]
          },
          // 中略
        ]
      }
    },
    {
      "join_explain": {
        "select#": 1,
        "steps": [
        ]
      }
    }
  ]
}

Duplicate Weedout

Duplicate Weedout には以下の特徴があります。

JOINしてから、一時テーブルを作成して重複を取り除く
メインクエリの1行に対してサブクエリの結果が複数行マッチする場合には、JOINで処理することによって重複が発生するので、それを取り除くという戦略
JOIN時にメインクエリとサブクエリのどちらが駆動表・内部表となるかはコストによって決まる（どちらともありうる）
- 重複の除去を最後にするので、JOINはどちらを駆動表として行ってもよいため

サンプルテーブルを用いた実行例から分かることは以下の通りです。

Materialization 戦略が選択されたときのクエリをベースとして、メインクエリ側に AND で条件を足している
final_semijoin_strategy が DuplicateWeedout
実行計画の Extra 列に Start temporary と End temporary と表示
EXPLAIN ANALYZE では一番上に Remove duplicate e rows using temporary table (weedout) と表示
- 重複の除去を一時テーブルを用いて実施している

SELECT *
FROM employee e
WHERE e.main_floor IN (
    SELECT main_floor
    FROM department d
    WHERE d.dept_name LIKE 'Sales%'
)
AND gender = 2;

実行計画

mysql> explain select * from employee e  where e.main_floor in ( select main_floor from department d  where d.dept_name like 'Sales%' ) and gender = 2;
+----+-------------+-------+------------+------+-------------------------+-------------------------+---------+----------------------+------+----------+------------------------------+
| id | select_type | table | partitions | type | possible_keys           | key                     | key_len | ref                  | rows | filtered | Extra                        |
+----+-------------+-------+------------+------+-------------------------+-------------------------+---------+----------------------+------+----------+------------------------------+
|  1 | SIMPLE      | d     | NULL       | ALL  | NULL                    | NULL                    | NULL    | NULL                 |    9 |    11.11 | Using where; Start temporary |
|  1 | SIMPLE      | e     | NULL       | ref  | employee_main_floor_IDX | employee_main_floor_IDX | 5       | sandbox.d.main_floor |    2 |    10.00 | Using where; End temporary   |
+----+-------------+-------+------------+------+-------------------------+-------------------------+---------+----------------------+------+----------+------------------------------+
2 rows in set, 1 warning (0.00 sec)

Note (Code 1003): /* select#1 */ select `sandbox`.`e`.`emp_id` AS `emp_id`,`sandbox`.`e`.`main_floor` AS `main_floor`,`sandbox`.`e`.`gender` AS `gender` from `sandbox`.`employee` `e` semi join (`sandbox`.`department` `d`) where ((`sandbox`.`e`.`main_floor` = `sandbox`.`d`.`main_floor`) and (`sandbox`.`e`.`gender` = 2) and (`sandbox`.`d`.`dept_name` like 'Sales%'))

mysql> explain analyze select * from employee e  where e.main_floor in ( select main_floor from department d  where d.dept_name like 'Sales%' ) and gender = 2\G
*************************** 1. row ***************************
EXPLAIN: -> Remove duplicate e rows using temporary table (weedout)  (cost=1.85 rows=0) (actual time=0.051..0.062 rows=2 loops=1)
    -> Nested loop inner join  (cost=1.85 rows=0) (actual time=0.047..0.057 rows=2 loops=1)
        -> Filter: ((d.dept_name like 'Sales%') and (d.main_floor is not null))  (cost=1.15 rows=1) (actual time=0.027..0.030 rows=2 loops=1)
            -> Table scan on d  (cost=1.15 rows=9) (actual time=0.022..0.025 rows=9 loops=1)
        -> Filter: (e.gender = 2)  (cost=0.52 rows=0) (actual time=0.011..0.012 rows=1 loops=2)
            -> Index lookup on e using employee_main_floor_IDX (main_floor=d.main_floor)  (cost=0.52 rows=2) (actual time=0.011..0.012 rows=2 loops=2)

オプティマイザトレース（一部抜粋）

{
  "steps": [
    {
      "join_preparation": {
        "select#": 1,
        "steps": [
          // 中略
          {
            "transformations_to_nested_joins": {
              "transformations": [
                "semijoin"
              ],
              "expanded_query": "/* select#1 */ select `e`.`emp_id` AS `emp_id`,`e`.`main_floor` AS `main_floor`,`e`.`gender` AS `gender` from `employee` `e` semi join (`department` `d`) where ((`e`.`gender` = 2) and (`d`.`dept_name` like 'Sales%') and (`e`.`main_floor` = `d`.`main_floor`))"
            }
          }
        ]
      }
    },
    {
      "join_optimization": {
        "select#": 1,
        "steps": [
          // 中略
          {
            "considered_execution_plans": [
              {
                "plan_prefix": [
                ],
                "table": "`department` `d`",
                "best_access_path": {
                  "considered_access_paths": [
                    {
                      "rows_to_scan": 9,
                      "filtering_effect": [
                      ],
                      "final_filtering_effect": 0.1111,
                      "access_type": "scan",
                      "resulting_rows": 1,
                      "cost": 1.15,
                      "chosen": true
                    }
                  ]
                },
                "condition_filtering_pct": 100,
                "rows_for_plan": 1,
                "cost_for_plan": 1.15,
                "semijoin_strategy_choice": [
                  {
                    "strategy": "MaterializeScan",
                    "choice": "deferred"
                  }
                ],
                "rest_of_plan": [
                  {
                    "plan_prefix": [
                      "`department` `d`"
                    ],
                    "table": "`employee` `e`",
                    "best_access_path": {
                      "considered_access_paths": [
                        {
                          "access_type": "ref",
                          "index": "employee_main_floor_IDX",
                          "rows": 2,
                          "cost": 0.7,
                          "chosen": true
                        },
                        {
                          "rows_to_scan": 36,
                          "filtering_effect": [
                          ],
                          "final_filtering_effect": 0.1,
                          "access_type": "scan",
                          "using_join_cache": true,
                          "buffers_needed": 1,
                          "resulting_rows": 3.6,
                          "cost": 3.8541,
                          "chosen": false
                        }
                      ]
                    },
                    "condition_filtering_pct": 10,
                    "rows_for_plan": 0.2,
                    "cost_for_plan": 1.85,
                    "semijoin_strategy_choice": [
                      {
                        "strategy": "MaterializeScan",
                        "recalculate_access_paths_and_cost": {
                          "tables": [
                            {
                              "table": "`employee` `e`",
                              "best_access_path": {
                                "considered_access_paths": [
                                  {
                                    "access_type": "ref",
                                    "index": "employee_main_floor_IDX",
                                    "rows": 2,
                                    "cost": 0.7,
                                    "chosen": true
                                  },
                                  {
                                    "rows_to_scan": 36,
                                    "filtering_effect": [
                                    ],
                                    "final_filtering_effect": 0.1,
                                    "access_type": "scan",
                                    "using_join_cache": true,
                                    "buffers_needed": 1,
                                    "resulting_rows": 3.6,
                                    "cost": 3.8541,
                                    "chosen": false
                                  }
                                ]
                              }
                            }
                          ]
                        },
                        "cost": 3.05,
                        "rows": 0.2,
                        "duplicate_tables_left": true,
                        "chosen": true
                      },
                      {
                        "strategy": "DuplicatesWeedout",
                        "cost": 2.89,
                        "rows": 0.2,
                        "duplicate_tables_left": false,
                        "chosen": true
                      }
                    ],
                    "chosen": true
                  }
                ]
              },
              {
                "plan_prefix": [
                ],
                "table": "`employee` `e`",
                "best_access_path": {
                  "considered_access_paths": [
                    {
                      "access_type": "ref",
                      "index": "employee_main_floor_IDX",
                      "usable": false,
                      "chosen": false
                    },
                    {
                      "rows_to_scan": 36,
                      "filtering_effect": [
                      ],
                      "final_filtering_effect": 0.1,
                      "access_type": "scan",
                      "resulting_rows": 3.6,
                      "cost": 3.85,
                      "chosen": true
                    }
                  ]
                },
                "condition_filtering_pct": 100,
                "rows_for_plan": 3.6,
                "cost_for_plan": 3.85,
                "semijoin_strategy_choice": [
                ],
                "pruned_by_cost": true
              },
              {
                "final_semijoin_strategy": "DuplicateWeedout"
              }
            ]
          },
          // 中略
        ]
      }
    },
    {
      "join_execution": {
        "select#": 1,
        "steps": [
        ]
      }
    }
  ]
}

FirstMatch

FirstMatch 戦略には以下の特徴があります。

通常のNLJ（Nested Loop Join）に似ているが、内側の（サブクエリ側の）ループを回しているときに1行でも見つかったら即座に内側のループを打ち切って外側のループの次の周回に進むことができるため効率が良い
メインクエリ側が必ず駆動表になる
- 重複の除去を、内側のループを途中で打ち切ることによって実現しているため

サンプルテーブルを用いた実行例から分かることは以下の通りです。

これまでのクエリとは違い、department表へのクエリをメインクエリとしている
- これまでは、サブクエリのテーブルの方が小さかったが、今回のクエリはメインクエリのテーブルの方が小さい
オプティマイザトレースのfinal_semijoin_strategy が FirstMatch
実行計画のExtra列に FirstMatch(department) の表示がある

SELECT * 
FROM department
WHERE department.main_floor IN (
    SELECT main_floor
    FROM employee
);

実行計画

mysql> explain select * from department where department.main_floor in ( select main_floor from employee );
+----+-------------+------------+------------+------+-------------------------+-------------------------+---------+-------------------------------+------+----------+-------------------------------------+
| id | select_type | table      | partitions | type | possible_keys           | key                     | key_len | ref                           | rows | filtered | Extra                               |
+----+-------------+------------+------------+------+-------------------------+-------------------------+---------+-------------------------------+------+----------+-------------------------------------+
|  1 | SIMPLE      | department | NULL       | ALL  | NULL                    | NULL                    | NULL    | NULL                          |    9 |   100.00 | Using where                         |
|  1 | SIMPLE      | employee   | NULL       | ref  | employee_main_floor_IDX | employee_main_floor_IDX | 5       | sandbox.department.main_floor |    2 |   100.00 | Using index; FirstMatch(department) |
+----+-------------+------------+------------+------+-------------------------+-------------------------+---------+-------------------------------+------+----------+-------------------------------------+
2 rows in set, 1 warning (0.00 sec)

Note (Code 1003): /* select#1 */ select `sandbox`.`department`.`dept_id` AS `dept_id`,`sandbox`.`department`.`main_floor` AS `main_floor`,`sandbox`.`department`.`dept_name` AS `dept_name` from `sandbox`.`department` semi join (`sandbox`.`employee`) where (`sandbox`.`employee`.`main_floor` = `sandbox`.`department`.`main_floor`)

mysql> explain analyze select * from department where department.main_floor in ( select main_floor from employee )\G
*************************** 1. row ***************************
EXPLAIN: -> Nested loop semijoin  (cost=5.20 rows=18) (actual time=0.034..0.050 rows=9 loops=1)
    -> Filter: (department.main_floor is not null)  (cost=1.15 rows=9) (actual time=0.022..0.026 rows=9 loops=1)
        -> Table scan on department  (cost=1.15 rows=9) (actual time=0.022..0.025 rows=9 loops=1)
    -> Index lookup on employee using employee_main_floor_IDX (main_floor=department.main_floor)  (cost=0.54 rows=2) (actual time=0.002..0.002 rows=1 loops=9)

オプティマイザトレース（一部抜粋）

{
  "steps": [
    {
      "join_preparation": {
        "select#": 1,
        "steps": [
          // 中略
          {
            "transformations_to_nested_joins": {
              "transformations": [
                "semijoin"
              ],
              "expanded_query": "/* select#1 */ select `department`.`dept_id` AS `dept_id`,`department`.`main_floor` AS `main_floor`,`department`.`dept_name` AS `dept_name` from `department` semi join (`employee`) where ((`department`.`main_floor` = `employee`.`main_floor`))"
            }
          }
        ]
      }
    },
    {
      "join_optimization": {
        "select#": 1,
        "steps": [
          // 中略
          {
            "considered_execution_plans": [
              {
                "plan_prefix": [
                ],
                "table": "`department`",
                "best_access_path": {
                  "considered_access_paths": [
                    {
                      "rows_to_scan": 9,
                      "filtering_effect": [
                      ],
                      "final_filtering_effect": 1,
                      "access_type": "scan",
                      "resulting_rows": 9,
                      "cost": 1.15,
                      "chosen": true
                    }
                  ]
                },
                "condition_filtering_pct": 100,
                "rows_for_plan": 9,
                "cost_for_plan": 1.15,
                "semijoin_strategy_choice": [
                ],
                "rest_of_plan": [
                  {
                    "plan_prefix": [
                      "`department`"
                    ],
                    "table": "`employee`",
                    "best_access_path": {
                      "considered_access_paths": [
                        {
                          "access_type": "ref",
                          "index": "employee_main_floor_IDX",
                          "rows": 2,
                          "cost": 4.0525,
                          "chosen": true
                        },
                        {
                          "access_type": "scan",
                          "chosen": false,
                          "cause": "covering_index_better_than_full_scan"
                        }
                      ]
                    },
                    "condition_filtering_pct": 100,
                    "rows_for_plan": 18,
                    "cost_for_plan": 5.2025,
                    "semijoin_strategy_choice": [
                      {
                        "strategy": "FirstMatch",
                        "recalculate_access_paths_and_cost": {
                          "tables": [
                          ]
                        },
                        "cost": 5.2025,
                        "rows": 9,
                        "chosen": true
                      },
                      {
                        "strategy": "MaterializeLookup",
                        "cost": 10.5,
                        "rows": 9,
                        "duplicate_tables_left": false,
                        "chosen": false
                      },
                      {
                        "strategy": "DuplicatesWeedout",
                        "cost": 8.9025,
                        "rows": 9,
                        "duplicate_tables_left": false,
                        "chosen": false
                      }
                    ],
                    "chosen": true
                  }
                ]
              },
              {
                "plan_prefix": [
                ],
                "table": "`employee`",
                "best_access_path": {
                  "considered_access_paths": [
                    {
                      "access_type": "ref",
                      "index": "employee_main_floor_IDX",
                      "usable": false,
                      "chosen": false
                    },
                    {
                      "rows_to_scan": 36,
                      "filtering_effect": [
                      ],
                      "final_filtering_effect": 1,
                      "access_type": "scan",
                      "resulting_rows": 36,
                      "cost": 3.85,
                      "chosen": true
                    }
                  ]
                },
                "condition_filtering_pct": 100,
                "rows_for_plan": 36,
                "cost_for_plan": 3.85,
                "semijoin_strategy_choice": [
                  {
                    "strategy": "MaterializeScan",
                    "choice": "deferred"
                  }
                ],
                "pruned_by_heuristic": true
              },
              {
                "final_semijoin_strategy": "FirstMatch",
                "recalculate_access_paths_and_cost": {
                  "tables": [
                  ]
                }
              }
            ]
          },
          // 中略
        ]
      }
    },
    {
      "join_execution": {
        "select#": 1,
        "steps": [
        ]
      }
    }
  ]
}

終わりに

以上、セミジョイン最適化について、個々の戦略の内容を含めて書いてきました。サンプルテーブルを用いて取得した実行計画やオプティマイザトレースはあくまで一例に過ぎず、少し条件を変えるとまた違った結果が得られるかもしれません。冒頭にも書いたように、記載している内容に誤りなどを見つけた場合は筆者のTwitterまでご一報いただけると幸いです。

emp_id	main_floor	gender
1	1	1
2	2	2
3	3	3
4	4	1
5	5	2
6	6	3
7	7	1
8	8	2
9	9	3
10	10	1
11	11	2
12	12	3
13	13	1
14	14	2
15	15	3
16	16	1
17	17	2
18	18	3
19	1	1
20	2	2
21	3	3
22	4	1
23	5	2
24	6	3
25	7	1
26	8	2
27	9	3
28	10	1
29	11	2
30	12	3
31	13	1
32	14	2
33	15	3
34	16	1
35	17	2
36	18	3

emp_id	main_floor	gender
1	1	1
2	2	2
3	3	3
4	4	1
5	5	2
6	6	3
7	7	1
8	8	2
9	9	3
10	10	1
11	11	2
12	12	3
13	13	1
14	14	2
15	15	3
16	16	1
17	17	2
18	18	3
19	1	1
20	2	2
21	3	3
22	4	1
23	5	2
24	6	3
25	7	1
26	8	2
27	9	3
28	10	1
29	11	2
30	12	3
31	13	1
32	14	2
33	15	3
34	16	1
35	17	2
36	18	3

前置き

この記事で書くこと

この記事で書かないこと

MySQLの公式ドキュメントを読むだけでわかること

コードリーディングに基づいた知見

MySQLのバージョン間の比較

他のRDBMSとの比較

セミジョイン最適化概説

なぜセミジョイン最適化がパフォーマンス的に嬉しいのか？

セミジョインにはどのような「戦略」があるのか？

各戦略の特徴

サンプルテーブルの前提

使用したMySQLのバージョン

サンプルテーブル

Table Pullout

実行計画

オプティマイザトレース（一部抜粋）

LooseScan

実行計画

オプティマイザトレース（一部抜粋）

Materialization

実行計画

オプティマイザトレース（一部抜粋）

Duplicate Weedout

実行計画

オプティマイザトレース（一部抜粋）

FirstMatch

実行計画

オプティマイザトレース（一部抜粋）

終わりに

参考資料

公式

5.6

8.0

その他

emp_id	main_floor	gender
1	1	1
2	2	2
3	3	3
4	4	1
5	5	2
6	6	3
7	7	1
8	8	2
9	9	3
10	10	1
11	11	2
12	12	3
13	13	1
14	14	2
15	15	3
16	16	1
17	17	2
18	18	3
19	1	1
20	2	2
21	3	3
22	4	1
23	5	2
24	6	3
25	7	1
26	8	2
27	9	3
28	10	1
29	11	2
30	12	3
31	13	1
32	14	2
33	15	3
34	16	1
35	17	2
36	18	3