What is new about Rust (an introduction to its basic language features)

Rust is a programming language developed by Mozilla, the maker of Firefox, and used to build its next-generation browser. By introducing the concept of borrow checking, the compiler guarantees memory safety and data-race safety. It has been attracting growing attention since around the release of its stable version in mid-2015.

Memory safety means a state free of out-of-bounds memory accesses, double frees, null references, and accesses to uninitialized memory. Note, however, that the memory safety Rust talks about does not guarantee the absence of memory leaks.

Data-race safety means that a single object is never read from and written to at the same time in a way that makes the result undefined. This is different from a race condition.

Just as many languages successively introduced the concept of anonymous functions, programming languages influence one another and evolve over time. Rust introduced the idea of managing memory by including the "lifetime" in a variable's type, and it has been proven that this can achieve various kinds of soundness, including memory safety. It may therefore influence other languages (especially C++) in the future. I wrote this article hoping that even people who have never used Rust can understand Rust's philosophy without writing any Rust, and that it can help them think about how it might influence other languages going forward.

Rust's position as a programming language

Rust is an imperative language that adopts many features that have become standard in functional languages and is designed with use in system software in mind. Because the programmer can control all memory management, it holds up in constrained environments such as embedded systems, and it also allows execution-speed optimization on par with C.

A functional language is a programming language that avoids side effects (rewriting the contents of variables). Functional languages have a strong affinity with language research, and various concepts have been tried and incorporated on top of functional languages. Concepts such as anonymous functions have already been incorporated into many imperative languages, but Rust takes in even more of them.

System software refers to software in the domain closer to the system or hardware than to applications. However, the boundary between application software and system software is not clear-cut; for example, a browser has aspects of system software when viewed from a web application, but aspects of application software when viewed from the OS.

An imperative language is a programming language that makes you strongly aware of the order of computation. Because computer hardware implementations are imperative, many languages belong to the imperative category. Rust and Scala have elements of both functional and imperative languages and the boundary is fuzzy, but Scala tends to have you write iteration as recursion and is in many cases classified as a functional language, whereas Rust does not implement tail-call optimization and uses continue and break in loops as standard, so this article classifies it as an imperative language.

Rust has a mechanism to prevent, at compile time, the data races you must always watch out for when parallelizing. Python is often used as an auxiliary language for machine learning and scientific computing, but now that per-core computation speed is beginning to plateau, scripting languages that are also hard to parallelize struggle to improve computation speed through more than just execution-speed optimization. Languages that can be parallelized are expected to spread, and Rust, in which parallelization problems are less likely to occur, is expected to attract even more attention.

Many scripting languages, in order to achieve thread safety, need to limit the number of instructions that can execute simultaneously to one so as to prevent the execution of non-thread-safe instructions and the corruption of variable contents by data races (in Python this is called the Global Interpreter Lock), which makes parallelization difficult.

Also in recent years, statically typed languages have been drawing attention in the web industry too, with a proliferation of languages that transcompile to JavaScript. Rust has begun supporting an execution format called WebAssembly for fast computation on web browsers, and it may increasingly be chosen as one of the options going forward.

Transcompilation refers to converting not to machine code or a language close to it, but to a high-level language. Today the only language web browsers can interpret is JavaScript, but because JavaScript has many problems, languages that compile to JavaScript, such as CoffeeScript, JSX, and TypeScript, are on the rise to make up for them.

WebAssembly is an intermediate language created to achieve computation speed comparable to compiled languages in the browser. Because of that purpose, C++ is often used for its development, but Rust is expected to make up for C++'s weaknesses while losing almost no execution speed.

The history of memory management and Rust's solution

Until smart pointers were born

In C, allocating and freeing dynamic memory had to be done explicitly at the programmer's responsibility. However, it is difficult for humans to free allocated memory perfectly without forgetting, and since around 1990, programming languages in which the runtime manages memory became mainstream. Yet memory management by the runtime has problems such as "the program pauses in order to free memory" and "the runtime becomes complex", so when memory efficiency and execution efficiency matter, we could not eliminate situations that forced us to write programs that allocate and free memory manually in C or C++.

There are implementation methods for runtime memory management that stop the program and ones that do not, but the methods that stop the program are used often because they are simple to implement. Techniques that emphasize real-time properties have also been researched, but they have the problem that the implementation becomes more complex and overall execution efficiency tends to decline.

To resolve that situation, C++, in which the timing when a destructor is called is clear, introduced smart pointers that allocate memory in the constructor and free it in the destructor. The smart pointer unique_ptr has no memory overhead at all compared with allocating and freeing memory as a raw pointer, and in addition, in ordinary business programming such as web-service development, unique_ptr is sufficient in many situations, so now there are almost no situations in which memory must be freed explicitly. This showed that in many situations, memory can be managed without memory overhead and without forgetting to free it.

A destructor is a method that is called when an object is destroyed. It corresponds to the finalize method in Java, but in Java its use is not recommended because there is no guarantee that an object will be destroyed immediately. On the other hand, in C++ the timing when the destructor is called is determined (for a local variable, when it leaves scope), which made it possible to use it for smart pointers.

unique_ptr is a container for a raw pointer. A raw pointer allocated with new must have delete called on it exactly once, but there was a problem that when branching occurs it is difficult to call delete without forgetting on all paths.

void UserFunction(int value) {
  string* s = new string();
  if (value == 0) {
    // You must always call delete before returning.
    delete s;
    return;
  }
  delete s;
}

So, by taking advantage of the fact that the destructor of a local variable is always called when it leaves scope, it became possible to prevent forgetting to call delete.

void UserFunction(int value) {
  std::unique_ptr<string> s(new string());
  if (value == 0) {
    // When you return, the destructor of unique_ptr is called,
    // and unique_ptr calls delete inside its destructor.
    return;
  }
}

As the Google C++ Style Guide says, "Do not design your code to use shared ownership without a very good reason." If you design well, unique_ptr is sufficient in almost all cases.

Problems with unique_ptr

With the appearance of unique_ptr, forgetting to free allocated memory disappeared, but there is a problem: if you transfer the ownership (the right to free) of the pointer held in a unique_ptr to another unique_ptr, the original unique_ptr is reset, and accessing the original unique_ptr afterward causes a memory access violation (execution example).

// C++
int main() {
  // This is equivalent to std::unique_ptr<string> hoge(new string("hoge"));
  // It avoids writing the type name twice and removes new from the code.
  auto hoge = std::make_unique<std::string>("hoge");
  auto piyo = std::make_unique<std::string>("piyo");
  println(*hoge);  // => hoge
  println(*piyo);  // => piyo
  hoge = std::move(piyo);  // Ownership of the pointer in piyo moves to hoge.
  println(*hoge);  // => piyo
  println(*piyo);  // piyo has been reset, so a memory access violation occurs.
}

Transfer (move semantics)

C++11 introduced move semantics, making it possible to transfer (move) values for various types including string types. However, moves have to be written explicitly, and there is a problem that a copy occurs if you forget to write it.

// C++
int main() {
  std::string hoge = "hoge";
  std::string piyo = "piyo";
  println(hoge);  // => hoge
  println(piyo);  // => piyo
  // Writing hoge = piyo; causes a copy of the string!
  hoge = std::move(piyo);  // Ownership of the string in piyo moves to hoge.
  println(hoge);  // => piyo
  println(piyo);  // => empty string
}

Rust's solution

Rust made the assignment operator "=" perform a move unless the type is explicitly specified as copyable. It also made access to a moved-from variable forbidden at compile time.

fn main() {
  let mut hoge = "hoge".to_string();  // put the string "hoge" into the variable hoge
  let piyo = "piyo".to_string();  // put the string "piyo" into the variable piyo
  println!("{}", hoge);  // => hoge
  println!("{}", piyo);  // => piyo
  hoge = piyo;  // ownership of the string in piyo is moved to hoge
  println!("{}", hoge);  // => piyo
  // println!("{}", piyo);  // uncommenting this causes a compile error
}

In the code above (execution example), if you uncomment line 8, you end up using the variable piyo that was moved on line 6, and Rust's compiler finds the problem and outputs a compile error like the one below (execution example). This prevents, at compile time, the problems of accidentally copying a value or accessing an already-moved value.

error[E0382]: use of moved value: `piyo`
 --> test.rs:8:18
  |
6 |   hoge = piyo;
  |          ---- value moved here
7 |   println!("{}", hoge);
8 |   println!("{}", piyo);
  |                  ^^^^ value used here after move
  |
  = note: move occurs because `piyo` has type `std::string::String`, which does not implement the `Copy` trait

Variable lifetimes

Problems with passing by reference or pointer in C++

In C++, by passing a variable's value by reference or pointer, you can avoid copying the value. However, because it does not guarantee that the referent has a lifetime, it is possible to access an already-freed variable, which is a problem that tends to produce bugs.

// C++
#include <stdio.h>
#include <string>

int main() {
  std::string* a;
  {
    std::string b = "hoge";
    a = &b;
    // The lifetime (scope) of variable b ends here, and the string hoge is freed here.
  }
  printf("%s\n", a->c_str());  // Invalid access!
}

Rust's solution

Rust introduced the concept of borrowing (passing a reference and guaranteeing its lifetime), making it impossible to access an already-freed variable (one that has left scope).

fn main() {
  let a;
  {
    let b = "hoge".to_string();
    a = &b;  // Compile error!
  }
  println!("{}", a);
}

In the code above (execution example), line 5 assigns a reference to b into a. However, although a's lifetime is from line 2 to line 8, b's lifetime is from line 4 to line 6, which is shorter than the assignment target's. In other words, if a reference could be assigned to a, b might be referenced even after it is freed, so Rust detects this and outputs a compile error like the one below. This makes it possible to guarantee that access through a reference always succeeds.

error: `b` does not live long enough
 --> test.rs:5:10
  |
5 |     a = &b;
  |          ^ does not live long enough
6 |   }
  |   - borrowed value only lives until here
7 |   println!("{}", a);
8 | }
  | - borrowed value needs to live until here

Borrow checking (mutability and references)

Assuming that the value of a variable borrowed for reading might be rewritten leads to various problems: between different threads a data race occurs and it is no longer thread-safe, and even under a single thread it becomes difficult to guarantee the validity of iterators. So Rust restricts borrows to "one or more immutable (non-rewritable) references" or "one mutable (rewritable) reference".

fn f1(a: &String, b: &String) {}
fn f2(a: &mut String, b: &String) {}
fn f3(a: &mut String) {}

fn main() {
  let mut a = "hoge".to_string();  // declaration of a mutable variable
  f1(&a, &a);  // OK because only immutable references are borrowed
  f2(&mut a, &a);  // Compile error!
  f3(&mut a);  // OK because only one mutable reference is borrowed
  println!("{}", a);
}

In the code above (execution example), lines 7 and 9 compile, but line 8 does not. Because it borrows a mutable reference in addition to an immutable reference, it outputs a compile error like the one below. This makes it possible to guarantee that reading and writing a borrowed variable does not cause a data race.

error[E0502]: cannot borrow `a` as immutable because it is also borrowed as mutable
 --> test.rs:8:15
  |
8 |   f2(&mut a, &a);
  |           -   ^- mutable borrow ends here
  |           |   |
  |           |   immutable borrow occurs here
  |           mutable borrow occurs here

Lifetimes and binding of return values

In Rust you can return a reference as a return value.

// A function that, if orig contains prefix, returns the part of orig
// with the prefix removed, and if orig does not contain prefix, returns orig.
// 'a denotes a lifetime, and indicates that orig and the return value have the same lifetime.
fn maybe_remove_prefix<'a>(orig: &'a [i32], prefix: &[i32]) -> &'a [i32] {
  if orig.starts_with(prefix) {
    return orig.split_at(prefix.len()).1;
  } else {
    return orig;
  }
}

fn main() {
  let mut a = [1, 2, 3];
  let mut suffix;
  {
    let b = [1, 2];
    suffix = maybe_remove_prefix(&a, &b);
    // suffix = maybe_remove_prefix(&b, &a);
    // a = [4, 5, 6];
  }
  println!("{:?}", suffix);  // => [3]
}

In the code above (execution example), the lifetime of the reference returned by maybe_remove_prefix is defined, using the template argument 'a, to be the same as orig, because there are two reference arguments and it is not known which reference the result is generated from. Since a reference with a shorter lifetime cannot be assigned to suffix, uncommenting line 18 outputs a compile error like the one below. This guarantees the validity of the lifetime of the returned reference.

error: `b` does not live long enough
  --> test.rs:18:35
   |
18 |     suffix = maybe_remove_prefix(&b, &a);
   |                                   ^ does not live long enough
19 |     // a = [4, 5, 6];
20 |   }
   |   - borrowed value only lives until here
21 |   println!("{}", suffix);  // => piyo
22 | }
   | - borrowed value needs to live until here

When a reference is used as a return value, while the return value is in use the referent is bound to the return value and cannot be changed. Uncommenting line 19 would rewrite the variable a that is borrowed by suffix, so a compile error like the one below is output. This guarantees that the referent of the returned reference is not rewritten and remains valid.

error[E0506]: cannot assign to `a` because it is borrowed
  --> test.rs:19:5
   |
17 |     suffix = maybe_remove_prefix(&a, &b);
   |                                   - borrow of `a` occurs here
18 |     // suffix = maybe_remove_prefix(&b, &a);
19 |     a = [4, 5, 6];
   |     ^^^^^^^^^^^^^^^^^^^^^ assignment to borrowed `a` occurs here

Rust's language features

Enums that hold values

In many modern languages an enum is a type from which you can select one of several identifiers each assigned an integer, but Rust's enum is a type from which you can select one of several values (including structs and tuples), also known as a tagged union.

enum Option<T> {
  None,
  Some(T),
}

For example, the enum Option<T> frequently used in Rust is defined as in the code above, taking either None or Some(T). None holds no value, but Some(T) holds a value of type T. It is used, for example, when retrieving a value from an associative array, so that "whether a value exists or not" and "the value when it exists" can be expressed as a single value.

Pattern matching

Instead of the switch construct that exists in many languages, Rust has the match construct. The match construct is far more expressive than switch; it can also branch on the values held by an enum, which makes it flexible (execution example).

fn main() {
  let string_value = "1234";
  // parse<i64>() returns the enum Result<i64, ParseIntError>.
  // Result<T, E> is an enum consisting of Ok(T) and Err(E).
  match string_value.trim().parse::<i64>() {
    Ok(x) if x < 0 => println!("negative integer: {}", x),
    Ok(x) => println!("non-negative integer: {}", x),
    Err(e) => println!("failed to convert to an integer: {}", e),
  }
}

The compiler checks whether the patterns are exhaustive. For example, if you delete line 7 of the code above, a compile error like the one below is output (execution example).

error[E0004]: non-exhaustive patterns: `Ok(_)` not covered
 --> <anon>:5:9
  |
5 |   match string_value.trim().parse::<i64>() {
  |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ pattern `Ok(_)` not covered

Generics

Rust has generics, which let you provide types as compile-time parameters (execution example). If you want to perform some operation on that type, you need to specify the trait that can perform that operation (the property the type must implement) and restrict the type in advance.

Generics refers to the <T> part of List<T> in Java, and denotes the feature of providing types as compile-time parameters. C++ templates similarly provide types as compile-time parameters, but they restrict types at compile time via duck typing (based on whether the called method exists or not).

use std::ops::AddAssign;

// The type T must implement the Clone trait (T.clone()) and the
// AddAssign<T> trait (S += T).
fn sum<T: Clone + AddAssign<T>>(v: &Vec<T>, init: T) -> T {
  let mut result = init;
  for x in v {
    result += x.clone();
  }
  result
}

fn main() {
  let v = vec![10, 20, 30, 40];
  // Calls sum<i32, i32>.
  println!("{}", sum(&v, 0));  // => 100

  let v = vec![0.01, 0.02, 0.03, 0.04];
  // Calls sum<f64, f64>.
  println!("{}", sum(&v, 0.0));  // => 0.1
}

Defining operators for user-defined types

Rust lets you define operators (such as + and *) for enums and structs (execution example).

use std::ops::Add;

#[derive(Debug)]
struct Point(i32, i32);

impl Add for Point {
  type Output = Point;

  fn add(self, rhs: Point) -> Point {
    Point(self.0 + rhs.0, self.1 + rhs.1)
  }
}

fn main() {
  let result = Point(1, 2) + Point(3, 4);
  println!("{:?}", result);  // => Point(4, 6)
}

Reference counting and Mutex

Rust provides reference counters (Rc and Arc) that make an object referenceable from multiple owners. Arc is a thread-safe reference counter, and combining it with Mutex realizes writes from multiple threads (execution example). In many languages, a Mutex is prepared separately from the region it protects, but Rust makes the value to be protected explicit by expressing it as a container.

use std::thread;
use std::sync::{Arc, Mutex};

fn main() {
  let data = Arc::new(Mutex::new(0));
  let mut threads = vec![];
  for thread_id in 0..3 {
    println!("Starting thread (ID: {})", thread_id);
    let data = data.clone();
    threads.push(thread::spawn(move || {
      for _ in 0..100000 {
        *data.lock().unwrap() += 1;
      }
      println!("Thread (ID: {}) finished", thread_id);
    }));
  }

  for t in threads {
    t.join().unwrap();
  }

  println!("Total: {}", *data.lock().unwrap());
}

Variable initialization

Rust forces variables to be initialized, but not necessarily at declaration time (execution example).

fn user_function(flag: bool) {
  // result is not initialized at declaration, but it is assigned exactly once on
  // every path and can be interpreted as a non-rewritable variable, so it is not a compile error.
  let result : String;
  if flag {
    result = "hoge".into();
  } else {
    result = "piyo".into();
  }
  println!("{}", result);
}

If you delete line 8 of the code above, it can no longer be guaranteed that the variable result is initialized, so a compile error like the one below is output.

error[E0381]: use of possibly uninitialized variable: `result`
  --> <anon>:10:18
   |
10 |   println!("{}", result);
   |                  ^^^^^^ use of possibly uninitialized `result`

Powerful type inference

Rust has powerful type inference, so as long as the type is determined by a later expression, not just by an assignment at declaration, you do not need to state the type explicitly (execution example).

fn user_function(flag: bool) {
  // The type of result is not declared here, but it is ultimately inferred.
  // result is not initialized, but it is assigned exactly once on every path
  // and can be interpreted as a non-rewritable variable, so it is not a compile error.
  let result;
  if flag {
    // The return value of to_string() is of type String,
    // so the type of result is inferred to be String.
    result = "hoge".to_string();
  } else {
    // The return value of into() depends on the type of result, so it is not used for inference.
    // Since the type of result is inferred to be String,
    // into calls the type conversion from &str to String.
    result = "piyo".into();
  }
  println!("{}", result);
}

Reusing variable names

In Rust you define variables with let, but by using let again you can redefine the same variable name even with a different type, reducing the situations where you use mutable variables (execution example).

fn main() {
  // data is an immutable &str
  let data = "foo,bar";
  // data is an immutable Vec<&str>
  let data = data.split(",").collect::<Vec<_>>();
  // data is a mutable String
  let mut data = data.join(" ");
  data += " baz";
  println!("{}", data);  // => foo bar baz
}

Error handling

In Rust it is common to use Result<T, E> as the return type of a function that returns an error. Handling each error with an if statement, as in Go, makes the code verbose, but in Rust you can return an error using the postfix ? operator, which keeps the code simple (execution example).

use std::fs::OpenOptions;
use std::io::Write;

// If there is no value to return as the Result type, you can use the empty tuple type.
fn write_foo() -> Result<(), std::io::Error> {
  // A file object automatically closes the file when it is destroyed,
  // so you never forget to close it.
  // open returns Result<File, std::io::Error>, but the postfix ? operator
  // rewrites it to return File (and returns on error).
  OpenOptions::new().write(true).create(true).open("/tmp/foo")?.write(b"foo");
  Ok(())
}

fn user_function() -> Result<(), String> {
  // To change the error type, you can convert it with the map_err method.
  write_foo().map_err(|err| format!("Failed to write foo: {}", err))?;
  Ok(())
}

Hygienic macros

Rust provides a macro feature to make up for what cannot be achieved with other language features. Unlike generics, it does not restrict the types of arguments, so it allows flexible expressions that functions alone cannot. There are also many built-in macros, such as println! and vec!.

macro_rules! five_times {
  ($x:expr) => (5 * $x);
}

fn main() {
  println!("{}", five_times!(2 + 3));  // => 25
}

Writing the same processing in C would expand to 5 * 2 + 3 and print 13 (execution example), but Rust expands it to 5 * (2 + 3) without breaking the order of the expression, so the code above prints 25 (execution example).

Also, starting with Rust 1.15 (scheduled for release on February 2, 2017), proc_macro was added, making it possible to receive Rust code as input and replace it. This further increases flexibility, for example enabling features such as serialization and deserialization using struct field names without directly rewriting the compiler.

Interoperability with C

Rust aims to be a programming language that can also write system software. Because it manages memory explicitly and has no problems such as memory address changes caused by garbage collection, its interoperability with C is clear and easy.

Conclusion

To learn Rust, I tried writing a Lisp interpreter in Rust. When handling variable references in Rust there are restrictions on lifetimes and mutability, and writing with a sloppy design led to insufficient mutability; intuitively, compared with conventional imperative languages these have effects in a broad context, so there were somewhat bewildering situations where I had to rewrite various places. However, it is also true that fixing things according to these restrictions made the overall design cleaner, and the code may turn out relatively readable no matter who writes it. In addition, memory safety without overhead is very attractive, and it also has modern language features such as powerful type inference. Currently Mozilla uses it for browser development, but it is also becoming usable in things like WebAssembly, and in OSes and embedded systems it has a big advantage in memory efficiency and safety, so I firmly believe it will surely be used in various fields going forward.

Acknowledgments

The draft of this article was reviewed by @tanakh, @Linda_pp, and @ogiekako. Thank you very much.

References

The Rust Programming Language … the official introductory page for Rust
Rust Documentation … the Rust specification
Rust for functional programmers … an introduction to Rust for people who understand functional languages