<?xml version='1.0' encoding='utf-8'?>
<?xml-stylesheet type="text/xsl" href="/sheet.xsl"?><rss version="2.0"><channel><title>Gaël Varoquaux</title><item><title>Stepping up as probabl’s CSO to supercharge scikit-learn and its ecosystem</title><link>https://gael-varoquaux.info/programming/stepping-up-as-probabls-cso-to-supercharge-scikit-learn-and-its-ecosystem.html</link><description>&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;div class="figure align-right"&gt;
&lt;img alt="" src="../programming/attachments/probabl_team_2025.png" style="width: 400px;" /&gt;
&lt;p class="caption"&gt;Probabl’s get together, in falls 2025&lt;/p&gt;
&lt;/div&gt;
&lt;p class="last"&gt;I’m thrilled to announce that I’m stepping up as &lt;a class="reference external" href="https://probabl.ai/?utm_source=employee_blog&amp;amp;utm_medium=social_employee&amp;amp;utm_campaign=202601_probabl_awareness_post"&gt;Probabl&lt;/a&gt;’s CSO (Chief Science Officer) to supercharge
scikit-learn and its ecosystem, pursuing my dreams of tools that help go
from data to impact.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="scikit-learn-a-central-tool"&gt;
&lt;h2&gt;Scikit-learn, a central tool&lt;/h2&gt;
&lt;p&gt;Scikit-learn is central …&lt;/p&gt;&lt;/div&gt;</description><ns0:encoded xmlns:ns0="http://purl.org/rss/1.0/modules/content/">&lt;div class="content" morss_own_score="4.455445544554456" morss_score="10.009702521463367"&gt;


&lt;h1&gt;Stepping up as probabl’s CSO to supercharge scikit-learn and its ecosystem&lt;/h1&gt;
&lt;p&gt;
                            under                                 &lt;a href="https://gael-varoquaux.info/tag/open-source.html"&gt;open source&lt;/a&gt;
&lt;a href="https://gael-varoquaux.info/tag/growth.html"&gt;growth&lt;/a&gt;
&lt;a href="https://gael-varoquaux.info/tag/communities.html"&gt;communities&lt;/a&gt;
&lt;a href="https://gael-varoquaux.info/tag/scikit-learn.html"&gt;scikit-learn&lt;/a&gt;
&lt;a href="https://gael-varoquaux.info/tag/inria.html"&gt;inria&lt;/a&gt;
&lt;a href="https://gael-varoquaux.info/tag/probabl.html"&gt;probabl&lt;/a&gt;
&lt;span&gt;
			&amp;amp;nbsp Read time: 3 min.
		    &lt;/span&gt;


 &lt;/p&gt;


&lt;div class="admonition note" morss_own_score="2.6923076923076925" morss_score="7.7548076923076925"&gt;
&lt;p&gt;Note&lt;/p&gt;

&lt;img src="https://gael-varoquaux.info/programming/attachments/probabl_team_2025.png"&gt;
&lt;p&gt;Probabl’s get together, in falls 2025&lt;/p&gt;

&lt;p&gt;I’m thrilled to announce that I’m stepping up as &lt;a href="https://probabl.ai/?utm_source=employee_blog&amp;amp;utm_medium=social_employee&amp;amp;utm_campaign=202601_probabl_awareness_post"&gt;Probabl&lt;/a&gt;’s CSO (Chief Science Officer) to supercharge
scikit-learn and its ecosystem, pursuing my dreams of tools that help go
from data to impact.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="scikit-learn-a-central-tool" morss_own_score="1.3796791443850267" morss_score="12.379679144385026"&gt;
&lt;h2&gt;Scikit-learn, a central tool&lt;/h2&gt;
&lt;p morss_own_score="7.0" morss_score="9.0"&gt;Scikit-learn is central to data-scientists’ work: it is &lt;strong&gt;the most used
machine-learning package&lt;/strong&gt;. It has grown over more than a decade,
supported by volunteers’ time, donations, and grant funding, with a
central role of Inria.&lt;/p&gt;

&lt;img src="https://gael-varoquaux.info/programming/attachments/scikit-learn_clickpy_2025.png"&gt;
&lt;p&gt;Scikit-learn download numbers; &lt;a href="https://clickpy.clickhouse.com/dashboard/scikit-learn"&gt;reproduce and explore on clickpy&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And the usage numbers keep going up…&lt;/p&gt;
&lt;p&gt;Scikit-learn keeps growing because it enables crucial applications:
machine-learning that can be easily adapted to a given application. This
type of AI does not make the headlines, but it is central to the value
brought by data science. It is used across the board to extract insights
from data and automate business-specific processes, thus ensuring
function and efficiency of a wide variety of activities.&lt;/p&gt;

&lt;br&gt;

&lt;p&gt;And scikit-learn is quietly but steadily advancing. The recent releases
bring progress in all directions: computational foundations (&lt;a href="https://scikit-learn.org/stable/auto_examples/release_highlights/plot_release_highlights_1_8_0.html#array-api-support-enables-gpu-computations"&gt;the array
API enabling GPU support&lt;/a&gt;),
user interface (&lt;a href="https://scikit-learn.org/stable/auto_examples/release_highlights/plot_release_highlights_1_8_0.html#html-representation-of-estimators"&gt;rich HTML displays&lt;/a&gt;),
new models (eg &lt;a href="https://scikit-learn.org/stable/modules/generated/sklearn.cluster.HDBSCAN.html"&gt;HDBSCAN&lt;/a&gt;,
&lt;a href="https://scikit-learn.org/stable/auto_examples/release_highlights/plot_release_highlights_1_8_0.html#temperature-scaling-in-calibratedclassifiercv"&gt;temperature-scaling recalibration&lt;/a&gt; …), and always algorithmic
improvements (release 1.8 brought &lt;a href="https://scikit-learn.org/stable/auto_examples/release_highlights/plot_release_highlights_1_8_0.html#efficiency-improvements-in-linear-models"&gt;marked speed ups to linear models&lt;/a&gt; or
&lt;a href="https://scikit-learn.org/stable/auto_examples/release_highlights/plot_release_highlights_1_8_0.html#decisiontreeregressor-with-criterion-absolute-error"&gt;trees with MAE&lt;/a&gt;).&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="a-new-opportunity-to-boost-scikit-learn-and-its-ecosystem" morss_own_score="3.082872928176796" morss_score="9.587220754263752"&gt;
&lt;h2&gt;A new opportunity to boost scikit-learn and its ecosystem&lt;/h2&gt;
&lt;p morss_own_score="6.0" morss_score="8.5"&gt;Probabl recently raised a &lt;a href="https://blog.probabl.ai/probabl-raises-a-13m-in-seed-to-accelerate-enterprise-grade-ai?utm_source=employee_blog&amp;amp;utm_medium=social_employee&amp;amp;utm_campaign=202601_blog_awareness_post"&gt;beautiful seed funding&lt;/a&gt;
from investors who really understand the value and perspective of
scikit-learn. We have a unique opportunity to accelerate scikit-learn’s
development. Our analysis is that &lt;strong&gt;enterprises need dedicated tooling and
partners to build best on scikit-learn&lt;/strong&gt;, and we’re hard at work to provide
this.&lt;/p&gt;
&lt;p&gt;2/3rd of probabl’s founders are scikit-learn contributors and we have
been investing in all aspects of scikit-learn: features, releases,
communication, documentation, and training. In addition, part of
scikit-learn’s success has always been to nurture an ecosystem, for
instance via its simple API that has become a standard. Thus Probabl is
not only consolidating scikit-learn, but also this ecosystem: the &lt;a href="https://skops.readthedocs.io/en/stable/"&gt;skops
project, to put scikit-learn based models in production&lt;/a&gt;, the &lt;a href="https://skrub-data.org"&gt;skrub project, that
facilitates data preparation&lt;/a&gt;, the &lt;a href="https://skore.probabl.ai/?utm_source=employee_blog&amp;amp;utm_medium=social_employee&amp;amp;utm_campaign=202601_skore_awareness_post"&gt;young skore
project to track data science&lt;/a&gt;, &lt;a href="https://fairlearn.org/"&gt;fairlearn
to help avoiding machine learning that discriminates&lt;/a&gt;, and more upstream projects, such as &lt;a href="https://joblib.readthedocs.io/en/stable/"&gt;joblib
for parallel computing&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="section" id="my-obsession-as-probabl-cso-serving-the-data-scientists" morss_own_score="3.0" morss_score="16.0"&gt;
&lt;h2&gt;My obsession as Probabl CSO: serving the data scientists&lt;/h2&gt;
&lt;p morss_own_score="7.0" morss_score="11.0"&gt;As CSO (Chief Science Officer) at Probabl, my role is to nourish our
development strategy with understanding of machine learning, data
science, and open source. Making sure that &lt;strong&gt;scikit-learn and its
ecosystem are enterprise ready&lt;/strong&gt; will bring resources for scikit-learn’s
sustainability, enabling its ecosystem to grow into a standard-setting
platform for the industry, that continues &lt;strong&gt;to serve data scientists&lt;/strong&gt;.
This mission will require consolidating the existing tools and patterns,
and inventing new ones.&lt;/p&gt;

&lt;br&gt;

&lt;p&gt;Probabl is in a unique position for this endeavor: Our core is an amazing
team of engineers with deep knowledge of data science. Working directly
with businesses gives us an acute understanding of where the ecosystem
can be improved. On this topic, I also profoundly enjoy working with
people who have a different DNA than the historical DNA of scikit-learn,
with product research, marketing, and business mindsets. I believe that
the union of our different cultures will make the scikit-learn ecosystem
better.&lt;/p&gt;
&lt;p&gt;Beyond the Probabl team, we have an amazing community, with a broader
group of scikit-learn contributors who do an amazing job bringing
together what makes scikit-learn so versatile, with a deep ecosystem of
Python data tools enriched by so many different actors. I’m deeply
greatful to the many scikit-learn and pydata contributors. At Probabl, we
are very attuned to enabling the open-source contributor community. Such
a community is what enables a single tool, scikit-learn, to serve a long
tail of diverse usages.&lt;/p&gt;
&lt;/div&gt;



 &lt;a href="https://gael-varoquaux.info/programming/stepping-up-as-probabls-cso-to-supercharge-scikit-learn-and-its-ecosystem.html"&gt;Go Top&lt;/a&gt;
&lt;/div&gt;
</ns0:encoded><pubDate>Wed, 14 Jan 2026 00:00:00 </pubDate></item><item><title>2025 highlights: AI research and code</title><link>https://gael-varoquaux.info/science/2025-highlights-ai-research-and-code.html</link><description>&lt;div class="figure align-right"&gt;
&lt;img alt="" class="small" src="attachments/2025_highlights/eiffel_tower_ai.jpg" /&gt;
&lt;p class="caption"&gt;AI is everywhere. Can you see it here?&lt;/p&gt;
&lt;/div&gt;
&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;Some highlights about my work in 2025: progress on
tabular-learning stands out, a publication on unpacking trade-off and
consequences of scale in AI, and of course progress on the open-source
data-science and machine learning stack.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;As 2026 starts, I’m looking …&lt;/p&gt;</description><ns0:encoded xmlns:ns0="http://purl.org/rss/1.0/modules/content/">&lt;div class="content" morss_own_score="4.7174887892376685" morss_score="13.7715759602903"&gt;


&lt;h1&gt;2025 highlights: AI research and code&lt;/h1&gt;
&lt;p&gt;
                            under                                 &lt;a href="https://gael-varoquaux.info/tag/science.html"&gt;science&lt;/a&gt;
&lt;a href="https://gael-varoquaux.info/tag/research.html"&gt;research&lt;/a&gt;
&lt;a href="https://gael-varoquaux.info/tag/machine-learning.html"&gt;machine learning&lt;/a&gt;
&lt;a href="https://gael-varoquaux.info/tag/python.html"&gt;python&lt;/a&gt;
&lt;a href="https://gael-varoquaux.info/tag/yearly-report.html"&gt;yearly report&lt;/a&gt;
&lt;span&gt;
			&amp;amp;nbsp Read time: 6 min.
		    &lt;/span&gt;


 &lt;/p&gt;



&lt;img src="https://gael-varoquaux.info/science/attachments/2025_highlights/eiffel_tower_ai.jpg"&gt;
&lt;p&gt;AI is everywhere. Can you see it here?&lt;/p&gt;


&lt;p&gt;Note&lt;/p&gt;
&lt;p&gt;Some highlights about my work in 2025: progress on
tabular-learning stands out, a publication on unpacking trade-off and
consequences of scale in AI, and of course progress on the open-source
data-science and machine learning stack.&lt;/p&gt;

&lt;p&gt;As 2026 starts, I’m looking back on 2025. It was all about AI, with
research in the &lt;a href="https://team.inria.fr/soda/"&gt;soda team&lt;/a&gt; on tabular
machine learning stimulating better software.&lt;/p&gt;

&lt;p&gt;Highlights&lt;/p&gt;

&lt;div class="section" id="beyond-maths-unpacking-the-scale-narrative-in-ai" morss_own_score="2.794736842105263" morss_score="14.077880781499204"&gt;
&lt;h2&gt;&lt;a href="https://gael-varoquaux.info/science/2025-highlights-ai-research-and-code.html#toc-entry-1"&gt;Beyond maths: Unpacking the scale narrative in AI&lt;/a&gt;&lt;/h2&gt;
&lt;p morss_own_score="6.878787878787879" morss_score="12.878787878787879"&gt;Plotting the increase of the scale of notable AI systems in the last
years reveals a staggering explosion. AI’s size has been growing super
exponentially on a variety of dimensions: training compute, training cost
(figure below), inference cost, amount of data used. Studying the wording
used in pivotal publications as well as company communications shows that
it anchors AI success in this growth, thus &lt;strong&gt;settings implicit social
norms around scale&lt;/strong&gt;. But systematic analysis of benchmark results show
that &lt;strong&gt;scale does not always bring benefit&lt;/strong&gt;. The narrative of scale is
simplified and leaves aside many important ingredients of success of AI
systems. In addition, the race for scale comes with planetary and
societal consequences, which we study and &lt;a href="https://dl.acm.org/doi/10.1145/3715275.3732006"&gt;document&lt;/a&gt;. Ever-increasing
inference costs threaten economic and electricity sustainability. An
unstoppable appetite for training data leads to fitting models on
enormous datasets that elude quality control, engulfing undesirable
facets of internet (including child pornography) or eroding privacy. The
race for scale has financial consequences, benefiting above all actors of
compute, but also structuring an ecosystem where cash-rich and GPU-rich
actors have leverage on priorities, industrial or academic. These actors
sometimes have circular investments strategies: funding third parties
that will spend all this funding in compute, which can fuel &lt;strong&gt;an
investment bubble in AI&lt;/strong&gt;.&lt;/p&gt;

&lt;img src="https://gael-varoquaux.info/science/attachments/2025_highlights/cost_ai.png"&gt;
&lt;p&gt;Evolution of the training cost (in dollars) of notable AI systems
across the years&lt;/p&gt;

&lt;p&gt;We conclude our study, &lt;a href="https://dl.acm.org/doi/10.1145/3715275.3732006"&gt;published at FAccT&lt;/a&gt;, by underlining that &lt;strong&gt;academic
research has a central role to play in these dynamics and must shape a
healthy and grounded narrative&lt;/strong&gt;. We recommend to:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;pursue basic AI research of interest independent of scale, &lt;em&gt;eg&lt;/em&gt;
uncertainty quantification, causality…&lt;/li&gt;
&lt;li&gt;hold responsible norms, in particular avoiding asking for compute
increase when editing or reviewing,&lt;/li&gt;
&lt;li&gt;always publish measures of compute to document the tradeoffs.&lt;/li&gt;
&lt;/ol&gt;

&lt;img src="https://gael-varoquaux.info/science/attachments/2025_highlights/pareto_schema.png"&gt;
&lt;p&gt;We need to document and explore the tradeoffs&lt;/p&gt;

&lt;p&gt;In addition, I personally want to push those tradeoffs in the direction
of resource efficient progress, and not only resource intensive progress
(as illustrated on the figure alongside),
which is the easy route to task performance, but not the one that brings
most value.&lt;/p&gt;
&lt;/div&gt;

&lt;h2&gt;&lt;a href="https://gael-varoquaux.info/science/2025-highlights-ai-research-and-code.html#toc-entry-2"&gt;Tabular-learning research&lt;/a&gt;&lt;/h2&gt;
&lt;div class="section" id="tabicl-open-source-table-foundation-model" morss_own_score="2.6407185628742513" morss_score="11.969986855557178"&gt;
&lt;h3&gt;&lt;a href="https://gael-varoquaux.info/science/2025-highlights-ai-research-and-code.html#toc-entry-3"&gt;TabICL:  open-source table foundation model&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Recent tabular-learning models have been bringing better performance. A
poster example is that of the TabPFN series of models, that rely on
pretrained transformers to bring excellent performance. However, the
quadratic complexity of the transformers is a bottleneck. I do fear that
the agenda of fancy tabular learning is leading us into a race for scale
again.&lt;/p&gt;
&lt;p&gt;With the &lt;a href="https://icml.cc/virtual/2025/poster/46681"&gt;TabICL model&lt;/a&gt; we
strives to decrease this computational cost. We showed that a multi-stage
architecture can build a pre-trained in-context predictor where the
separation of states decreases the quadratic cost. The model can be
pretrained on larger datasets, and thus results in the best performer in
settings of larger tables. The model is faster than alternatives, in
particular when using a CPU rather than a GPU. In addition, we released
in &lt;strong&gt;open source all the code&lt;/strong&gt;, including the pretraining.&lt;/p&gt;
&lt;p&gt;TabICL gives a table foundation model that is easy to use on modest or
big hardware and that can be easily customized.&lt;/p&gt;

&lt;br&gt;

&lt;/div&gt;
&lt;div class="section" id="retrieve-merge-predict-tradeoffs-of-predictions-from-data-lakes" morss_own_score="2.718475073313783" morss_score="16.903159757998466"&gt;
&lt;h3&gt;&lt;a href="https://gael-varoquaux.info/science/2025-highlights-ai-research-and-code.html#toc-entry-4"&gt;Retrieve merge predict: tradeoffs of predictions from data lakes&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;A full data-science pipeline must often assemble data across multiple
source tables:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p morss_own_score="7.0" morss_score="10.5"&gt;&lt;em&gt;Alice is working on a base table that contains information about
movies. She has also access to a data lake, or a collection of other
tables on all sorts of subjects. She wants to predict the ranking of
a movie based on as much information as possible. She would like to
extract information from the data lake to the performance of her
model.&lt;/em&gt;&lt;/p&gt;
&lt;p morss_own_score="7.0" morss_score="10.5"&gt;&lt;em&gt;The challenge is that the information of interest is mixed with a
huge amount of unrelated data. Thus, Alice’s problem is: “how to find
tables that are relevant to my problem? how to combine them with the
base table?”&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;When the user is faced with a complex data lake, many
tables and little explicit link between them, it is difficult to find the
best assembly for a given machine-learning task. This problem requires
not only finding which table must be joined in the main table of interest
–a table retrieval problem–, but also how to aggregate multiple records
when tables are linked through a many-to-one relation. While table
retrieval is a classic problem of the data management literature, it had
been understudied in the case of supervised machine learning. We
assembled a systematic –and open– benchmark with data lakes emph{and}
supervised-learning tasks (&lt;a href="https://openreview.net/pdf?id=4uPJN6yfY1"&gt;publication&lt;/a&gt;, &lt;a href="https://soda-inria.github.io/retrieve-merge-predict/"&gt;benchmark material&lt;/a&gt; ).&lt;/p&gt;
&lt;p&gt;We found that supervised learning does change the picture compared to
classic table-retrieval settings in that for a fixed compute budget, it
is worth avoiding fancy retrieval methods, which can be very
computationally costly, and rather using better supervised learning
methods, which can be comparatively less expensive while being
able to extract the relevant information from a noisy retrieval.&lt;/p&gt;

&lt;img src="https://gael-varoquaux.info/science/attachments/2025_highlights/yadl_benchmark.png"&gt;
&lt;p&gt;A schema of the pipeline&lt;/p&gt;

&lt;p&gt;The pipeline that we studied here is one that is broader than the
typical machine-learning modeling step. In my experience, data-science
applications are often much more complex than mere tabular learning, and
for these reason, we develop the skrub software, described below.&lt;/p&gt;

&lt;br&gt;

&lt;/div&gt;


&lt;h2&gt;&lt;a href="https://gael-varoquaux.info/science/2025-highlights-ai-research-and-code.html#toc-entry-5"&gt;Growing the machine learning and data science stack&lt;/a&gt;&lt;/h2&gt;

&lt;h3&gt;&lt;a href="https://gael-varoquaux.info/science/2025-highlights-ai-research-and-code.html#toc-entry-6"&gt;Skrub: machine learning with tables&lt;/a&gt;&lt;/h3&gt;
&lt;a href="https://skrub-data.org"&gt;&lt;img src="https://gael-varoquaux.info/science/attachments/skrub_logo.png"&gt;&lt;/a&gt;
&lt;p&gt;&lt;a href="https://skrub-data.org"&gt;Skrub&lt;/a&gt; is a recent library to blend machine
learning with data-frame computing. In 2025, we have ironed existing
features to make them more performant and really easy to use. For
instance the &lt;a href="https://skrub-data.org/stable/reference/generated/skrub.TableVectorizer.html"&gt;TableVectorizer&lt;/a&gt;
is incredibly useful to build tabular machine-learning pipelines. But we
have also added exciting new features:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The &lt;a href="https://skrub-data.org/stable/reference/generated/skrub.ApplyToCols.html"&gt;ApplyToCols&lt;/a&gt; is an object that can use skrub’s powerful &lt;a href="https://skrub-data.org/stable/modules/multi_column_operations/selectors.html"&gt;selectors&lt;/a&gt; to apply transforms to some columns but not others. I find myself using it all the time.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://skrub-data.org/stable/data_ops.html"&gt;DataOps&lt;/a&gt; are an
incredibly powerful way of blending dataframe transformation and
scikit-learn fit/transform/predict API, to build complete machine
learning pipeline across multiple tables. The benefit is that, unlike
standard data wrangling code, they can be applied to new data,
cross-validated, or any component of the pipeline can be tuned to
maximize a prediction score. We even have added optuna support for this
tuning.&lt;/li&gt;
&lt;/ul&gt;


&lt;h3&gt;&lt;a href="https://gael-varoquaux.info/science/2025-highlights-ai-research-and-code.html#toc-entry-7"&gt;Fundamental progress in scikit-learn&lt;/a&gt;&lt;/h3&gt;
&lt;a href="https://scikit-learn.org"&gt;&lt;img src="https://gael-varoquaux.info/science/attachments/scikit-learn-logo.png"&gt;&lt;/a&gt;
&lt;p&gt;What strikes me in the 2025 releases of &lt;a href="https://scikit-learn.org"&gt;scikit-learn&lt;/a&gt; is that we have been
making progress on fundamental improvements to the core features:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Faster linear models and tree based model due to better algorithms
(which, in certain cases give massive speedups).&lt;/li&gt;
&lt;li&gt;Ramping up GPU support: we are progressively adding to scikit-learn a
compute backend that enable GPU computing (an intro &lt;a href="https://scikit-learn.org/stable/auto_examples/release_highlights/plot_release_highlights_1_8_0.html#array-api-support-enables-gpu-computations"&gt;here&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Free-threaded: we now support the “free-threaded” version of Python,
which removes a central lock and opens the door to
heavily-multithreaded parallel computing. More of the ecosystem needs
to support Python free-threaded for it to be widely used, but I am
hoping that in the mid-term we’ll see great improvements to parallel
computing.&lt;/li&gt;
&lt;/ul&gt;

&lt;br&gt;

&lt;p&gt;Exciting times :)&lt;/p&gt;





 &lt;a href="https://gael-varoquaux.info/science/2025-highlights-ai-research-and-code.html"&gt;Go Top&lt;/a&gt;
&lt;/div&gt;
</ns0:encoded><pubDate>Fri, 02 Jan 2026 00:00:00 </pubDate></item><item><title>Maïc, you lived 100 years, what changed?</title><link>https://gael-varoquaux.info/personnal/maic-you-lived-100-years-what-changed.html</link><description>&lt;p&gt;At Maïc’s 100th birthday, I asked her “you lived 100 years, what was the most important change for you?”. She mentioned “Internet”. I asked, why was the Internet important to her eyes? Because this is how she kept close contact with her loved ones, sharing travels or discussing everyday …&lt;/p&gt;</description><ns0:encoded xmlns:ns0="http://purl.org/rss/1.0/modules/content/">&lt;div class="content" morss_own_score="5.702031602708804" morss_score="34.39049976105648"&gt;


&lt;h1&gt;Maïc, you lived 100 years, what changed?&lt;/h1&gt;
&lt;p&gt;
                            under                                 &lt;a href="https://gael-varoquaux.info/tag/family.html"&gt;family&lt;/a&gt;
&lt;a href="https://gael-varoquaux.info/tag/people.html"&gt;people&lt;/a&gt;
&lt;span&gt;
			&amp;amp;nbsp Read time: 3 min.
		    &lt;/span&gt;


 &lt;/p&gt;


&lt;p&gt;At Maïc’s 100th birthday, I asked her “you lived 100 years, what was the most important change for you?”. She mentioned “Internet”. I asked, why was the Internet important to her eyes? Because this is how she kept close contact with her loved ones, sharing travels or discussing everyday life on her phone, her tablet…&lt;/p&gt;

&lt;br&gt;

&lt;p&gt;Born in 1925, she was of a generation sometimes called the silent one. And indeed, she was often low-key. Her father was an administrator in the countryside, and she arrived in Paris in her youth. She studied maths, joining the prestigious “Ecole Normale Supérieure”, which provided her with an income and led her to become a maths teacher. After meeting and marrying &lt;a href="https://gael-varoquaux.info/personnal/jean-dechoux-june-13rd-1923-feb-9th-2020.html"&gt;Jean Dechoux&lt;/a&gt;, she used her income to fund his medical studies. The story goes that, living in a tiny room, she had to cook on the balcony.&lt;/p&gt;
&lt;p&gt;Maïc was a teacher, one of those unsung heroes that have educated the masses. Nowadays, this is not a job title that is much acclaimed, unlike say “start-up founder”. But the only reason we have good computer scientists that create start-ups, the only reason we have researchers to build computer science, is because they had great teachers. Maïc was also a mother, a foster mother, a grandmother, a great grandmother. She was kind, humble, tireless, always positive. Her life philosophy was focused on doing the best with what she got.&lt;/p&gt;
&lt;p&gt;Maïc never seemed left behind by the transformations of our world. Turning 100-years old, she was as sharp as ever, reading book after book and using her phone, her tablet, her computer. Whenever I hear how technology changes the world, I cannot help thinking of her, a 100-year-old geek. The world went through many transformations during her lifetime. But what she saw in these transformations, in Internet technology, is a way to stay in contact with others, a way to bring more humanity into our lives.&lt;/p&gt;
&lt;img src="https://gael-varoquaux.info/personnal/attachments/nicole_dechoux.jpg"&gt;
&lt;br&gt;&lt;p&gt;&lt;em&gt;Remembering Nicole Dechoux, May 03rd 1925 - October 22nd 2025&lt;/em&gt;&lt;/p&gt;
&lt;br&gt;
&lt;br&gt;
&lt;div class="poem docutils container" morss_own_score="3.0" morss_score="27.0"&gt;
&lt;p&gt;Il restera de toi ce que tu as donné&lt;/p&gt;
&lt;p&gt;Au lieu de le garder dans des coffres rouillés…&lt;/p&gt;
&lt;p&gt;Ce que tu as donné en d’autres fleurira…&lt;/p&gt;

&lt;br&gt;

&lt;p&gt;Il restera de toi ce que tu as offert&lt;/p&gt;
&lt;p&gt;Entre tes bras ouverts un matin au soleil…&lt;/p&gt;
&lt;p&gt;Ce que tu as offert en d’autres revivra…&lt;/p&gt;

&lt;br&gt;

&lt;p&gt;Il restera de toi un sourire épanoui&lt;/p&gt;
&lt;p&gt;Aux bords de tes lèvres comme au bord de ton cœur…&lt;/p&gt;
&lt;p&gt;Ce que tu as ouvert en d’autres grandira…&lt;/p&gt;

&lt;br&gt;

&lt;p&gt;Il restera de toi ce que tu as semé&lt;/p&gt;
&lt;p&gt;Que tu as partagé aux mendiants du bonheur…&lt;/p&gt;
&lt;p&gt;Ce que tu as semé en d’autres germera…&lt;/p&gt;

&lt;br&gt;

&lt;/div&gt;
&lt;p&gt;&lt;em&gt;Adapted from Simone Weil and Michel Scouarnec&lt;/em&gt;&lt;/p&gt;



 &lt;a href="https://gael-varoquaux.info/personnal/maic-you-lived-100-years-what-changed.html"&gt;Go Top&lt;/a&gt;
&lt;/div&gt;
</ns0:encoded><pubDate>Wed, 29 Oct 2025 00:00:00 </pubDate></item><item><title>A national recognition; but science and open source are bitter victories</title><link>https://gael-varoquaux.info/personnal/a-national-recognition-but-science-and-open-source-are-bitter-victories.html</link><description>&lt;img alt="" class="align-right" src="../personnal/attachments/gael_speech.jpg" style="width: 400px;" /&gt;
&lt;p&gt;I have recently been awarded &lt;a class="reference external" href="https://en.wikipedia.org/wiki/Ordre_national_du_M%C3%A9rite"&gt;France’s national order of merit&lt;/a&gt;, for my career, in science, in open source, and around AI.&lt;/p&gt;
&lt;p&gt;The speech that I gave carries messages important to me (French below; it
flows better).&lt;/p&gt;
&lt;div class="contents topic" id="contents"&gt;
&lt;p class="topic-title"&gt;Contents&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference internal" href="#speech-translated-to-english" id="toc-entry-1"&gt;Speech translated to English&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="reference internal" href="#le-texte-d-origine-en-francais" id="toc-entry-2"&gt;Le texte d’origine, en Français&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;style&gt;
.content p …&lt;/style&gt;</description><ns0:encoded xmlns:ns0="http://purl.org/rss/1.0/modules/content/">&lt;div class="content" morss_own_score="5.613240418118467" morss_score="16.171956236461615"&gt;


&lt;h1&gt;A national recognition; but science and open source are bitter victories&lt;/h1&gt;
&lt;p&gt;
                            under                                 &lt;a href="https://gael-varoquaux.info/tag/award.html"&gt;award&lt;/a&gt;
&lt;a href="https://gael-varoquaux.info/tag/open-source.html"&gt;open source&lt;/a&gt;
&lt;a href="https://gael-varoquaux.info/tag/science.html"&gt;science&lt;/a&gt;
&lt;span&gt;
			&amp;amp;nbsp Read time: 6 min.
		    &lt;/span&gt;


 &lt;/p&gt;


&lt;img src="https://gael-varoquaux.info/personnal/attachments/gael_speech.jpg"&gt;
&lt;p&gt;I have recently been awarded &lt;a href="https://en.wikipedia.org/wiki/Ordre_national_du_M%C3%A9rite"&gt;France’s national order of merit&lt;/a&gt;, for my career, in science, in open source, and around AI.&lt;/p&gt;
&lt;p&gt;The speech that I gave carries messages important to me (French below; it
flows better).&lt;/p&gt;

&lt;p&gt;Contents&lt;/p&gt;

&lt;div class="section" id="speech-translated-to-english" morss_own_score="2.9387755102040813" morss_score="44.93877551020408"&gt;
&lt;h2&gt;&lt;a href="https://gael-varoquaux.info/personnal/a-national-recognition-but-science-and-open-source-are-bitter-victories.html#toc-entry-1"&gt;Speech translated to English&lt;/a&gt;&lt;/h2&gt;

&lt;br&gt;

&lt;p&gt;Receiving such a medal is a powerful symbol. But what battles does it honor?&lt;/p&gt;
&lt;p&gt;My first battle, my first dream, was that of science, with the hope of understanding and improving the world. I probably turned to computers because they were simpler, less frightening, than society.&lt;/p&gt;
&lt;p&gt;This led me to my second battle: the dream of democratizing this science and these digital tools, thanks to open source, also in the hope of making a better world.&lt;/p&gt;
&lt;p&gt;The freedom I enjoyed, a privilege of researchers, allowed me to devote my time to these dreams. And many people helped on this journey, such as my colleagues at Inria and elsewhere –science is a team sport–, or free software developers from all over the world.&lt;/p&gt;

&lt;br&gt;

&lt;p&gt;And two decades later, we have won. Open source is everywhere. Statistical algorithms raise billions of dollars. But what good will this free software, these algorithms, have been if an Elon Musk can buy their vector of action and transform it into a fascist machine. This victory is bitter.&lt;/p&gt;

&lt;br&gt;

&lt;p&gt;Science, open source, come to play within a societal context, mediated by norms and means of action. These means of action are rooted in economic rationality, and I find myself, to my great surprise, interested in commercial and financial logics.&lt;/p&gt;
&lt;p&gt;Money is power. It is the ability to build, to buy Twitter or to finance Wikipedia. For science or open source to be successful, we need economic ambitions.&lt;/p&gt;
&lt;p&gt;But I do not want to reduce the world to economic motivations. Science and free software result from the work of individuals who believe in what they are doing. With scikit-learn, as with many other open source projects, humble developers with few resources have created incredible wealth.&lt;/p&gt;
&lt;p&gt;And it is these battles that today’s medal rewards. I have always been wary of individual distinctions. Success is rarely the work of a single person. We need more collective effort and fewer heroes, less ego.&lt;/p&gt;
&lt;p&gt;And yet, I hope that this medal, this symbol, can be useful. Indeed, symbols create the collective narrative, and control the choices we make, individually or as a society. For both science and free software, the risk is to be invisible, unheard, and powerless.&lt;/p&gt;

&lt;br&gt;

&lt;p&gt;Neither lines of code nor equations will be enough to make a better world. The privilege of a researcher is the independence of thoughts necessary for the consolidation of knowledge. The unique strength of open source software is to offer independence to the user. Beyond independence, this knowledge and these software are only useful if society embraces them. And for that, we must win the battle of the narrative.&lt;/p&gt;

&lt;br&gt;

&lt;p&gt;Today, I have only one dream: that our children live in the best possible world. Between the global rise of fascism and climate warming, this dream faces many challenges. But we can fight for it. For this, as always, we need to gather people and unite around the right causes. And thus, I thank you all for the support and help you have given me across the years, for today’s recognition.&lt;/p&gt;
&lt;p&gt;✶ ✶ ✶&lt;/p&gt;

&lt;br&gt;

&lt;/div&gt;
&lt;div class="section" id="le-texte-d-origine-en-francais" morss_own_score="5.869565217391305" morss_score="48.369565217391305"&gt;
&lt;h2&gt;&lt;a href="https://gael-varoquaux.info/personnal/a-national-recognition-but-science-and-open-source-are-bitter-victories.html#toc-entry-2"&gt;Le texte d’origine, en Français&lt;/a&gt;&lt;/h2&gt;

&lt;br&gt;

&lt;p&gt;Recevoir un tel insigne est un symbole puissant. Mais quels combats décore-t-il?&lt;/p&gt;
&lt;p&gt;Mon premier combat, mon premier rêve a été celui de la science, avec l’espoir de comprendre et d’améliorer le monde. Je me suis probablement tourné vers les ordinateurs car ils étaient plus simples, moins effrayants, que la société.&lt;/p&gt;
&lt;p&gt;Un deuxième combat est né en moi: le rêve de démocratiser cette science et ces outils numériques, grâce au logiciel libre, toujours dans l’espoir de faire un monde meilleur.&lt;/p&gt;
&lt;p&gt;La liberté dont j’ai joui, privilège inouï des chercheurs, m’a permis de me consacrer à ces rêves. Et beaucoup m’ont aidé: mes collègues à Inria et ailleurs, car la science est un sport d’équipe; les développeurs logiciels libres partout dans le monde; mes parents, qui m’ont donné l’amour de la science même lorsque j’étais en échec scolaire.&lt;/p&gt;

&lt;br&gt;

&lt;p&gt;Et deux décennies plus tard, nous avons gagné. Les logiciels libres sont partout. Les algorithmes statistiques font des levées de fonds de plusieurs milliards. Mais à quoi auront servi ces logiciels libres, ces algorithmes, si un Elon Musk peut racheter leur vecteur d’action et le transformer en machine à fascisme. Cette victoire est amère.&lt;/p&gt;

&lt;br&gt;

&lt;p&gt;La science, le logiciel libre, se réalisent dans un contexte sociétal, médié par des normes et des moyens d’actions. Ces moyens d’actions sont ancrés dans le rationnel économique, et je me trouve, à ma grande surprise, à m’intéresser à des logiques commerciales et financières.&lt;/p&gt;
&lt;p&gt;L’argent, c’est le pouvoir. C’est la capacité de réaliser, de racheter twitter ou de financer wikipedia. Pour le succès de la science ou du logiciel libre, nous avons besoin d’une ambition économique.&lt;/p&gt;
&lt;p&gt;Mais je ne voudrais réduire le monde aux motivations économiques. La science et le logiciel libre résultent du travail d’individus qui croient à ce qu’ils font. Avec scikit-learn, comme avec beaucoup d’autres logiciels libres, des développeurs humbles et avec peu de moyens ont créé une richesse incroyable.&lt;/p&gt;
&lt;p&gt;Et c’est ces combats que récompense aujourd’hui l’insigne que je reçois. Je me suis toujours méfié des distinctions individuelles. Un succès est rarement l’œuvre d’un seul. Nous avons besoin de plus de collectif et de moins de héros, de moins d’égo.&lt;/p&gt;
&lt;p&gt;Et pourtant, j’espère que cette médaille, ce symbole, peut être utile. En effet, les symboles créent le récit collectif, et contrôlent les choix que nous faisons, individuellement ou en tant que société. Science comme logiciel libre, le risque est d’être invisibles, inaudibles, et impuissants.&lt;/p&gt;

&lt;br&gt;

&lt;p&gt;La ligne de code, ou l’équation, ne suffiront à faire un meilleur monde. Le privilège du chercheur, c’est l’indépendance de pensée nécessaire à la consolidation de la connaissance. L’atout du logiciel libre, c’est d’offrir une indépendance à l’utilisateur. Au-delà de l’indépendance, cette connaissance et ces logiciels ne sont utiles que si la société s’en empare. Et pour cela, il nous faut gagner la bataille du récit.&lt;/p&gt;

&lt;br&gt;

&lt;p&gt;Aujourd’hui, je n’ai plus qu’un rêve: que nos enfants vivent dans le meilleur monde possible. Entre montée mondiale du fascisme et réchauffement climatique, j’ai la détermination que ce rêve ne soit pas une chimère. Pour ce rêve, il nous faut encore réunir, rassembler, et je vous remercie tous des soutiens et des aides que vous m’avez apportés, de cet honneur que vous me faites aujourd’hui.&lt;/p&gt;

&lt;img src="https://gael-varoquaux.info/personnal/attachments/gael_knight_monty_python.jpg"&gt;
&lt;p&gt;Technically, I might be a knight now&lt;/p&gt;

&lt;p&gt;✶ ✶ ✶&lt;/p&gt;
&lt;/div&gt;



 &lt;a href="https://gael-varoquaux.info/personnal/a-national-recognition-but-science-and-open-source-are-bitter-victories.html"&gt;Go Top&lt;/a&gt;
&lt;/div&gt;
</ns0:encoded><pubDate>Fri, 10 Oct 2025 00:00:00 </pubDate></item><item><title>TabICL: Pretraining the best tabular learner</title><link>https://gael-varoquaux.info/science/tabicl-pretraining-the-best-tabular-learner.html</link><description>&lt;div class="admonition note"&gt;
&lt;p class="first admonition-title"&gt;Note&lt;/p&gt;
&lt;p class="last"&gt;TabICL is a state-of-the-art tabular learner &lt;a class="reference external" href="https://arxiv.org/abs/2502.05564"&gt;[Qu et al 2025]&lt;/a&gt;. The key is its very rich
prior, that is baked in a pre-trained architecture -a table foundation
model-, and leveraged by in-context-learning. Thanks to clever
choices, it is fast and scalable, efficient even without a GPU.&lt;/p&gt;
&lt;/div&gt;
&lt;div class="contents topic" id="contents"&gt;
&lt;p class="topic-title"&gt;Contents&lt;/p&gt;
&lt;ul class="simple"&gt;
&lt;li&gt;&lt;a class="reference internal" href="#recent-progress-in-tabular-learning-in-context-learning" id="toc-entry-1"&gt;Recent progress …&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;</description><ns0:encoded xmlns:ns0="http://purl.org/rss/1.0/modules/content/">&lt;div class="content" morss_own_score="4.47" morss_score="14.146258986983659"&gt;


&lt;h1&gt;TabICL: Pretraining the best tabular learner&lt;/h1&gt;
&lt;p&gt;
                            under                                 &lt;a href="https://gael-varoquaux.info/tag/machine-learning.html"&gt;machine learning&lt;/a&gt;
&lt;a href="https://gael-varoquaux.info/tag/tabular-learning.html"&gt;tabular learning&lt;/a&gt;
&lt;a href="https://gael-varoquaux.info/tag/foundation-models.html"&gt;foundation models&lt;/a&gt;
&lt;span&gt;
			&amp;amp;nbsp Read time: 5 min.
		    &lt;/span&gt;


 &lt;/p&gt;



&lt;p&gt;Note&lt;/p&gt;
&lt;p&gt;TabICL is a state-of-the-art tabular learner &lt;a href="https://arxiv.org/abs/2502.05564"&gt;[Qu et al 2025]&lt;/a&gt;. The key is its very rich
prior, that is baked in a pre-trained architecture -a table foundation
model-, and leveraged by in-context-learning. Thanks to clever
choices, it is fast and scalable, efficient even without a GPU.&lt;/p&gt;


&lt;p&gt;Contents&lt;/p&gt;

&lt;p&gt;This note is about the research behind TabICL &lt;a href="https://arxiv.org/abs/2502.05564"&gt;[Qu et al 2025]&lt;/a&gt;, work by Jingang Qu, David
Holzmüller, myself, and Marine Le Morvan, published at ICML 2025, and
available as &lt;a href="https://tabicl.readthedocs.io/en/latest/"&gt;open-source software&lt;/a&gt;.&lt;/p&gt;

&lt;br&gt;

&lt;div class="section" id="recent-progress-in-tabular-learning-in-context-learning" morss_own_score="4.822784810126583" morss_score="14.253598763614955"&gt;
&lt;h2&gt;&lt;a href="https://gael-varoquaux.info/science/tabicl-pretraining-the-best-tabular-learner.html#toc-entry-1"&gt;Recent progress in tabular learning: In-Context Learning&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Describing the statistical structure of tables in general is very subtle.
They do have some unique statistical features. For instance, each column
is typically meaningful by itself, more meaningful than linear
combinations of columns (data &lt;em&gt;non rotationally invariant&lt;/em&gt;, cf
&lt;a href="https://proceedings.neurips.cc/paper_files/paper/2022/hash/0378c7692da36807bdec87ab043cdadc-Abstract-Datasets_and_Benchmarks.html"&gt;[Grinsztajn et al, 2022]&lt;/a&gt;).
For long, tree-based models, in particular gradient-boosted trees, were
the models that best captured this statistical structure.&lt;/p&gt;
&lt;p&gt;The question is indeed: &lt;strong&gt;how to build complex and rich inductive biases
into statistical models&lt;/strong&gt;?&lt;/p&gt;
&lt;p&gt;A pioneering contribution to this question was made with the TabPFN
approach &lt;a href="https://www.nature.com/articles/s41586-024-08328-6"&gt;[Hollmann et al, 2025]&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;&lt;a href="https://gael-varoquaux.info/science/tabicl-pretraining-the-best-tabular-learner.html#toc-entry-2"&gt;Tabular learning as a completion problem&lt;/a&gt;&lt;/h3&gt;

&lt;img src="https://gael-varoquaux.info/science/attachments/tabicl/table_in_context_learning.png"&gt;
&lt;p&gt;Prediction by table completion using across-row transformers&lt;/p&gt;

&lt;p&gt;The key idea behind this line of work is that tabular learning can be
seen as completing a table where one column has a missing entry.
Transformer-based large-language models are very good at completing
sequences, in particular in the few-shot regime. Hence the idea to use a
transformer architecture for this table-completion task.&lt;/p&gt;
&lt;p&gt;More specifically, this is a &lt;em&gt;meta-learning&lt;/em&gt; setting (learning to learn),
using transformers.&lt;/p&gt;

&lt;div class="section" id="sophisticated-prior-via-data-generation" morss_own_score="2.65" morss_score="15.65"&gt;
&lt;h3&gt;&lt;a href="https://gael-varoquaux.info/science/tabicl-pretraining-the-best-tabular-learner.html#toc-entry-3"&gt;Sophisticated prior via data generation&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Teaching transformers to predict well requires showing them many many
prediction problems.&lt;/p&gt;
&lt;p&gt;The benefit of this approach is that these prediction problems can be
chosen to reflect well the downstream task. In particular, it becomes now
easy to bake in any form of inductive bias by simulating data.&lt;/p&gt;
&lt;p&gt;TabPFN simulates data by cascading a series of simple transformations
combining very few columns. The data-generative processes are actually
more subtle, but the idea being that they are plausible for data tables.&lt;/p&gt;
&lt;p&gt;Experience (from us and others) shows that pretraining on a quality
data-generation process is crucial to produce a good tabular learner,
alike foundation models in other settings.&lt;/p&gt;

&lt;br&gt;

&lt;/div&gt;
&lt;/div&gt;

&lt;h2&gt;&lt;a href="https://gael-varoquaux.info/science/tabicl-pretraining-the-best-tabular-learner.html#toc-entry-4"&gt;TabICL: improved architecture&lt;/a&gt;&lt;/h2&gt;

&lt;h3&gt;&lt;a href="https://gael-varoquaux.info/science/tabicl-pretraining-the-best-tabular-learner.html#toc-entry-5"&gt;The challenge: accounting for the structure of tables&lt;/a&gt;&lt;/h3&gt;

&lt;img src="https://gael-varoquaux.info/science/attachments/tabicl/tabpfn_architecture.png"&gt;
&lt;p&gt;Tables are 2D objects, and the TabPFNv2 architecture alternates
attentions across row and across columns&lt;/p&gt;

&lt;p&gt;In practice, a table is not a 1D structure, like sentences. It is closer
to a 2D structure, with rows and columns. A good architecture will
account for this structure, and the TabPFNv2 architecture uses
transformers with alternating across-row and across-column attention.&lt;/p&gt;
&lt;p morss_own_score="7.0" morss_score="13.0"&gt;One problem is the computational complexity: attention is quadratic in
the number of entries, and the bi-directional transform of TabPFNv2 leads
to a cost in &lt;em&gt;O(n p² + p n²)&lt;/em&gt; for a table with &lt;em&gt;n&lt;/em&gt; rows and &lt;em&gt;p&lt;/em&gt; columns.&lt;/p&gt;


&lt;h3&gt;&lt;a href="https://gael-varoquaux.info/science/tabicl-pretraining-the-best-tabular-learner.html#toc-entry-6"&gt;TabICL’s solution&lt;/a&gt;&lt;/h3&gt;

&lt;h4&gt;Row-wise encoding&lt;/h4&gt;

&lt;img src="https://gael-varoquaux.info/science/attachments/tabicl/tabicl_architecture.png"&gt;
&lt;p&gt;To break the quadratic cost, TabICL first encodes the rows to a
smaller, fixed-sized, represention, before performing across-row
in-context learning.&lt;/p&gt;

&lt;p morss_own_score="7.0" morss_score="13.0"&gt;For more scalability and better inductive bias, our model, TabICL, first
embeds the rows (using a first transformer) and then does in-context
learning across rows (with a second transformer). The resulting
computational complexity is &lt;em&gt;O(n p² + n²)&lt;/em&gt;, which is more scalable,
though still quadratic in &lt;em&gt;n&lt;/em&gt; and &lt;em&gt;p&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;Scalability is important because it enables us to pretrain TabICL on both
small &lt;em&gt;and&lt;/em&gt; large datasets, and as a consquence TabICL is a good
predictor for large datasets.&lt;/p&gt;

&lt;br&gt;


&lt;div class="section" id="column-specific-embeddings" morss_own_score="6.0" morss_score="14.5"&gt;
&lt;h4&gt;Column-specific embeddings&lt;/h4&gt;

&lt;img src="https://gael-varoquaux.info/science/attachments/tabicl/tabicl_embeddings.png"&gt;
&lt;p&gt;To apply different transformations on columns depending on their
statistical properties, TabICL builds positional embeddings for
columns that capture aspects of their distribution.&lt;/p&gt;

&lt;p&gt;Another important innovation of TabICL is that it inputs the entries in
the transformer with column-specific embeddings. These column embeddings
are computed to be a function of the distribution of the column. For
this, we use a set transformer, which is a scalable transformer-like way
of building a function on sets (but without the quadratic complexity).&lt;/p&gt;
&lt;p&gt;After pretraining, we find that the column embeddings have learned a
mapping that implicitly captures statistical aspects of the data
distribution in the column, as the kurtosis or the skewness.&lt;/p&gt;
&lt;/div&gt;


&lt;div class="section" id="the-result-a-powerful-and-easy-to-use-tabular-learner" morss_own_score="4.920634920634921" morss_score="25.702453102453102"&gt;
&lt;h2&gt;&lt;a href="https://gael-varoquaux.info/science/tabicl-pretraining-the-best-tabular-learner.html#toc-entry-7"&gt;The result: a powerful and easy to use tabular learner&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;After a lot of pretraining on synthetic data, TabICL is a
state-of-the-art tabular learner. Pretraining gave it the right inductive
bias, as visible from the classifier-comparison plot below:&lt;/p&gt;

&lt;img src="https://gael-varoquaux.info/science/attachments/tabicl/tabicl_comparison.png"&gt;
&lt;p&gt;A classic classification comparison plot that shows the decision
boundaries on very simple toy data. It is useful to get a feeling of
how classifiers behave.&lt;/p&gt;

&lt;p&gt;It is interesting to see that while TabICL forms very flexible decision
boundaries, they do extend along the horizontal and vertical axes, as the
decision tree and random forest. These axis-aligned features are a
very important aspect of the inductive bias.&lt;/p&gt;
&lt;p&gt;At the end of the day, TabICL is an excellent tabular learner, as visible
on benchmarks:&lt;/p&gt;

&lt;img src="https://gael-varoquaux.info/science/attachments/tabicl/result_comparison.png"&gt;
&lt;p&gt;TabICL is a great predictor: Comparison of many predictors.&lt;/p&gt;


&lt;img src="https://gael-varoquaux.info/science/attachments/tabicl/tabarena.png"&gt;
&lt;p&gt;Experimental results, from a benchmark paper independent of the TabICL
paper: TabArena &lt;a href="https://arxiv.org/abs/2506.16791"&gt;[Erickson et al, 2025]&lt;/a&gt;&lt;/p&gt;


&lt;br&gt;

&lt;p&gt;The benefit of TabICL over TabPFNv2 becomes more marked for larger datasets:&lt;/p&gt;

&lt;img src="https://gael-varoquaux.info/science/attachments/tabicl/tabicl_scale_bench.png"&gt;
&lt;p&gt;Rank (lower is best) as a function of dataset size.&lt;/p&gt;

&lt;p&gt;However, one limitation to keep in mind is that with in-context learners,
as TabICL or TabPFN, inference (prediction on new datapoint) ican be
costly.&lt;/p&gt;

&lt;br&gt;

&lt;p&gt;All in all, TabICL is an excellent tabular predictor, and a push forward
for tabular foundation models. From a fundamental standpoint, it shows
that in-context learning is not only for few-shot learning, but that it can be
very beneficial for sample sizes as large as &lt;em&gt;n=100 000&lt;/em&gt;.&lt;/p&gt;

&lt;br&gt;


&lt;p&gt;More about TabICL&lt;/p&gt;
&lt;p&gt;There is a lot more in TabICL: the details of pretraining are crucial,
implementation uses memory offloading (which is facilitated by the
architecture, which dissociates the train X from the test y for most
of the operations). To learn more about TabICL:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The paper: &lt;a href="https://arxiv.org/abs/2502.05564"&gt;https://arxiv.org/abs/2502.05564&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;The github code: &lt;strong&gt;TabICL is 100% open source&lt;/strong&gt;
&lt;a href="https://github.com/soda-inria/tabicl"&gt;https://github.com/soda-inria/tabicl&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Install the Python package, TabICL is just one pip install away
&lt;a href="https://pypi.org/project/tabicl/"&gt;https://pypi.org/project/tabicl/&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;br&gt;


&lt;p&gt;Other topics in table foundation models: leveraging strings&lt;/p&gt;
&lt;p&gt;TabICL is only one aspect of table foundation models. We are pursuing
also another line of research that focuses on using strings (in
entries and column names) to bring knowledge about the real world in
table foundation models, see &lt;a href="https://gael-varoquaux.info/science/carte-toward-table-foundation-models.html"&gt;CARTE&lt;/a&gt; and more recently &lt;a href="https://arxiv.org/abs/2505.14415"&gt;[Kim
et al, 2025]&lt;/a&gt;.&lt;/p&gt;

&lt;/div&gt;



 &lt;a href="https://gael-varoquaux.info/science/tabicl-pretraining-the-best-tabular-learner.html"&gt;Go Top&lt;/a&gt;
&lt;/div&gt;
</ns0:encoded><pubDate>Wed, 09 Jul 2025 00:00:00 </pubDate></item></channel></rss>